Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | [Hay Elmaliah] | [315777433] | [hay.e@campus.technion.ac.il] |
| Student 2 | [Orad Barel] | [311288203] | [oradbarel@campus.technion.ac.il] |
In this assignment we'll create a from-scratch implementation of two fundemental deep learning concepts: the backpropagation algorithm and stochastic gradient descent-based optimizers. In addition, you will create a general-purpose multilayer perceptron, the core building block of deep neural networks. We'll visualize decision bounrdaries and ROC curves in the context of binary classification. Following that we will focus on convolutional networks with residual blocks. We'll create our own network architectures and train them using GPUs on the course servers, then we'll conduct architecture experiments to determine the the effects of different architectural decisions on the performance of deep networks.
hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.In this part, we'll implement backpropagation and automatic differentiation from scratch and compare our implementations to PyTorch's built in implementation (autograd).
import torch
import unittest
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
Reminder: The backpropagation algorithm is at the core of training deep models. To state the problem we'll tackle in this notebook, imagine we have an L-layer MLP model, defined as $$ \hat{\vec{y}^i} = \vec{y}L^i= \varphi_L \left( \mat{W}_L \varphi{L-1} \left( \cdots \varphi_1 \left( \mat{W}_1 \vec{x}^i + \vec{b}_1 \right) \cdots \right)
L(\vec{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \ell(\vec{y}^i, \hat{\vec{y}^i}) + R(\vec{\theta}) $$
where $\vec{\theta}$ is a vector containing all network parameters, e.g. $\vec{\theta} = \left[ \mat{W}_{1,:}, \vec{b}_1, \dots, \mat{W}_{L,:}, \vec{b}_L \right]$.
In order to train our model we would like to calculate the derivative (or gradient, in the multivariate case) of the loss with respect to each and every one of the parameters, i.e. $\pderiv{L}{\mat{W}_j}$ and $\pderiv{L}{\vec{b}_j}$ for all $j$. Since the gradient "points" to the direction of functional increase, the negative gradient is often used as a descent direction for descent-based optimization algorithms. In other words, iteratively updating each parameter proportianally to it's negetive gradient can lead to convergence to a local minimum of the loss function.
Calculus tells us that as long as we know the derivatives of all the functions "along the way" ($\varphi_i(\cdot),\ \ell(\cdot,\cdot),\ R(\cdot)$) we can use the chain rule to calculate the derivative of the loss with respect to any one of the parameter vectors. Note that if the loss $L(\vec{\theta})$ is scalar (which is usually the case), the gradient of a parameter will have the same shape as the parameter itself (matrix/vector/tensor of same dimensions).
For deep models that are a composition of many functions, calculating the gradient of each parameter by hand and implementing hard-coded gradient derivations quickly becomes infeasible. Additionally, such code makes models hard to change, since any change potentially requires re-derivation and re-implementation of the entire gradient function.
The backpropagation algorithm, which we saw in the lecture, provides us with a effective method of applying the chain rule recursively so that we can implement gradient calculations of arbitrarily deep or complex models.
We'll now implement backpropagation using a modular approach, which will allow us to chain many components layers together and get automatic gradient calculation of the output with respect to the input or any intermediate parameter.
To do this, we'll define a Layer class. Here's the API of this class:
import hw2.layers as layers
help(layers.Layer)
Help on class Layer in module hw2.layers:
class Layer(abc.ABC)
| A Layer is some computation element in a network architecture which
| supports automatic differentiation using forward and backward functions.
|
| Method resolution order:
| Layer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __call__(self, *args, **kwargs)
| Call self as a function.
|
| __init__(self)
| Initialize self. See help(type(self)) for accurate signature.
|
| __repr__(self)
| Return repr(self).
|
| backward(self, dout)
| Computes the backward pass of the layer, i.e. the gradient
| calculation of the final network output with respect to each of the
| parameters of the forward function.
| :param dout: The gradient of the network with respect to the
| output of this layer.
| :return: A tuple with the same number of elements as the parameters of
| the forward function. Each element will be the gradient of the
| network output with respect to that parameter.
|
| forward(self, *args, **kwargs)
| Computes the forward pass of the layer.
| :param args: The computation arguments (implementation specific).
| :return: The result of the computation.
|
| params(self)
| :return: Layer's trainable parameters and their gradients as a list
| of tuples, each tuple containing a tensor and it's corresponding
| gradient tensor.
|
| train(self, training_mode=True)
| Changes the mode of this layer between training and evaluation (test)
| mode. Some layers have different behaviour depending on mode.
| :param training_mode: True: set the model in training mode. False: set
| evaluation mode.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'backward', 'forward', 'params'})
In other words, a Layer can be anything: a layer, an activation function, a loss function or generally any computation that we know how to derive a gradient for.
Each Layer must define a forward() function and a backward() function.
forward() function performs the actual calculation/operation of the block and returns an output.backward() function computes the gradient of the input and parameters as a function of the gradient of the output, according to the chain rule.Here's a diagram illustrating the above explanation:
Note that the diagram doesn't show that if the function is parametrized, i.e. $f(\vec{x},\vec{y})=f(\vec{x},\vec{y};\vec{w})$, there are also gradients to calculate for the parameters $\vec{w}$.
The forward pass is straightforward: just do the computation. To understand the backward pass, imagine that there's some "downstream" loss function $L(\vec{\theta})$ and magically somehow we are told the gradient of that loss with respect to the output $\vec{z}$ of our block, i.e. $\pderiv{L}{\vec{z}}$.
Now, since we know how to calculate the derivative of $f(\vec{x},\vec{y};\vec{w})$, it means we know how to calculate $\pderiv{\vec{z}}{\vec{x}}$, $\pderiv{\vec{z}}{\vec{y}}$ and $\pderiv{\vec{z}}{\vec{w}}$ . Thanks to the chain rule, this is all we need to calculate the gradients of the loss w.r.t. the input and parameters:
$$ \begin{align} \pderiv{L}{\vec{x}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{x}}\\ \pderiv{L}{\vec{y}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{y}}\\ \pderiv{L}{\vec{w}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}} \end{align} $$PyTorch has the nn.Module base class, which may seem to be similar to our Layer since it also represents a computation element in a network.
However PyTorch's nn.Modules don't compute the gradient directly, they only define the forward calculations.
Instead, PyTorch has a more low-level API for defining a function and explicitly implementing it's forward() and backward(). See autograd.Function.
When an operation is performed on a tensor, it creates a Function instance which performs the operation and
stores any necessary information for calculating the gradient later on. Additionally, Functionss point to the
other Function objects representing the operations performed earlier on the tensor. Thus, a graph (or DAG)
of operations is created (this is not 100% exact, as the graph is actually composed of a different type of class which wraps the backward method, but it's accurate enough for our purposes).
A Tensor instance which was created by performing operations on one or more tensors with requires_grad=True, has a grad_fn property which is a Function instance representing the last operation performed to produce this tensor.
This exposes the graph of Function instances, each with it's own backward() function. Therefore, in PyTorch the backward() function is called on the tensors, not the modules.
Our Layers are therefore a combination of the ideas in Module and Function and we'll implement them together,
just to make things simpler.
Our goal here is to create a "poor man's autograd": We'll use PyTorch tensors,
but we'll calculate and store the gradients in our Layers (or return them).
The gradients we'll calculate are of the entire block, not individual operations on tensors.
To test our implementation, we'll use PyTorch's autograd.
Note that of course this method of tracking gradients is much more limited than what PyTorch offers. However it allows us to implement the backpropagation algorithm very simply and really see how it works.
Let's set up some testing instrumentation:
from hw2.grad_compare import compare_layer_to_torch
def test_block_grad(block: layers.Layer, x, y=None, delta=1e-3):
diffs = compare_layer_to_torch(block, x, y)
# Assert diff values
for diff in diffs:
test.assertLess(diff, delta)
# Show the compare function
compare_layer_to_torch??
Notes:
compare_layer_to_torch() function. It will help you understand what PyTorch is doing.delta above is should not be needed. A correct implementation will give you a diff of exactly zero.We'll now implement some Layers that will enable us to later build an MLP model of arbitrary depth, complete with automatic differentiation.
For each block, you'll first implement the forward() function.
Then, you will calculate the derivative of the block by hand with respect to each of its
input tensors and each of its parameter tensors (if any).
Using your manually-calculated derivation, you can then implement the backward() function.
Notice that we have intermediate Jacobians that are potentially high dimensional tensors. For example in the expression $\pderiv{L}{\vec{w}} = \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}}$, the term $\pderiv{\vec{z}}{\vec{w}}$ is a 4D Jacobian if both $\vec{z}$ and $\vec{w}$ are 2D matrices.
In order to implement the backpropagation algorithm efficiently, we need to implement every backward function without explicitly constructing this Jacobian. Instead, we're interested in directly calculating the vector-Jacobian product (VJP) $\pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}}$. In order to do this, you should try to figure out the gradient of the loss with respect to one element, e.g. $\pderiv{L}{\vec{w}_{1,1}}$ and extrapolate from there how to directly obtain the VJP.
ReLU, or rectified linear unit is a very common activation function in deep learning architectures. In it's most standard form, as we'll implement here, it has no parameters.
We'll first implement the "leaky" version, defined as
$$ \mathrm{relu}(\vec{x}) = \max(\alpha\vec{x},\vec{x}), \ 0\leq\alpha<1 $$This is similar to the ReLU activation we've seen in class, only that it has a small non-zero slope then it's input is negative. Note that it's not strictly differentiable, however it has sub-gradients, defined separately any positive-valued input and for negative-valued input.
TODO: Complete the implementation of the LeakyReLU class in the hw2/layers.py module.
N = 100
in_features = 200
num_classes = 10
eps = 1e-6
# Test LeakyReLU
alpha = 0.1
lrelu = layers.LeakyReLU(alpha=alpha)
x_test = torch.randn(N, in_features)
# Test forward pass
z = lrelu(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.nn.LeakyReLU(alpha)(x_test), atol=eps))
# Test backward pass
test_block_grad(lrelu, x_test)
Comparing gradients... input diff=0.000
Now using the LeakyReLU, we can trivially define a regular ReLU block as a special case.
TODO: Complete the implementation of the ReLU class in the hw2/layers.py module.
# Test ReLU
relu = layers.ReLU()
x_test = torch.randn(N, in_features)
# Test forward pass
z = relu(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.relu(x_test), atol=eps))
# Test backward pass
test_block_grad(relu, x_test)
Comparing gradients... input diff=0.000
The sigmoid function $\sigma(x)$ is also sometimes used as an activation function. We have also seen it previously in the context of logistic regression.
The sigmoid function is defined as
$$ \sigma(\vec{x}) = \frac{1}{1+\exp(-\vec{x})}. $$# Test Sigmoid
sigmoid = layers.Sigmoid()
x_test = torch.randn(N, in_features, in_features) # 3D input should work
# Test forward pass
z = sigmoid(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.sigmoid(x_test), atol=eps))
# Test backward pass
test_block_grad(sigmoid, x_test)
Comparing gradients... input diff=0.000
The hyperbolic tangent function $\tanh(x)$ is a common activation function used when the output should be in the range [-1, 1].
The tanh function is defined as
$$ \tanh(\vec{x}) = \frac{\exp(x)-\exp(-x)}{\exp(x)+\exp(-\vec{x})}. $$# Test TanH
tanh = layers.TanH()
x_test = torch.randn(N, in_features, in_features) # 3D input should work
# Test forward pass
z = tanh(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.tanh(x_test), atol=eps))
# Test backward pass
test_block_grad(tanh, x_test)
Comparing gradients... input diff=0.000
First, we'll implement an affine transform layer, also known as a fully connected layer.
Given an input $\mat{X}$ the layer computes,
$$ \mat{Z} = \mat{X} \mattr{W} + \vec{b} ,~ \mat{X}\in\set{R}^{N\times D_{\mathrm{in}}},~ \mat{W}\in\set{R}^{D_{\mathrm{out}}\times D_{\mathrm{in}}},~ \vec{b}\in\set{R}^{D_{\mathrm{out}}}. $$Notes:
TODO: Complete the implementation of the Linear class in the hw2/layers.py module.
# Test Linear
out_features = 1000
fc = layers.Linear(in_features, out_features)
x_test = torch.randn(N, in_features)
# Test forward pass
z = fc(x_test)
test.assertSequenceEqual(z.shape, [N, out_features])
torch_fc = torch.nn.Linear(in_features, out_features,bias=True)
torch_fc.weight = torch.nn.Parameter(fc.w)
torch_fc.bias = torch.nn.Parameter(fc.b)
test.assertTrue(torch.allclose(torch_fc(x_test), z, atol=eps))
# Test backward pass
test_block_grad(fc, x_test)
# Test second backward pass
x_test = torch.randn(N, in_features)
z = fc(x_test)
z = fc(x_test)
test_block_grad(fc, x_test)
Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000
As you know by know, cross-entropy is a common loss function for classification tasks. In class, we defined it as
$$\ell_{\mathrm{CE}}(\vec{y},\hat{\vec{y}}) = - {\vectr{y}} \log(\hat{\vec{y}})$$where $\hat{\vec{y}} = \mathrm{softmax}(x)$ is a probability vector (the output of softmax on the class scores $\vec{x}$) and the vector $\vec{y}$ is a 1-hot encoded class label.
However, it's tricky to compute the gradient of softmax, so instead we'll define a version of cross-entropy that produces the exact same output but works directly on the class scores $\vec{x}$.
We can write, $$\begin{align} \ell_{\mathrm{CE}}(\vec{y},\hat{\vec{y}}) &= - {\vectr{y}} \log(\hat{\vec{y}}) = - {\vectr{y}} \log\left(\mathrm{softmax}(\vec{x})\right) \\ &= - {\vectr{y}} \log\left(\frac{e^{\vec{x}}}{\sum_k e^{x_k}}\right) \\ &= - \log\left(\frac{e^{x_y}}{\sum_k e^{x_k}}\right) \\ &= - \left(\log\left(e^{x_y}\right) - \log\left(\sum_k e^{x_k}\right)\right)\\ &= - x_y + \log\left(\sum_k e^{x_k}\right) \end{align}$$
Where the scalar $y$ is the correct class label, so $x_y$ is the correct class score.
Note that this version of cross entropy is also what's provided by PyTorch's nn module.
TODO: Complete the implementation of the CrossEntropyLoss class in the hw2/layers.py module.
# Test CrossEntropy
cross_entropy = layers.CrossEntropyLoss()
scores = torch.randn(N, num_classes)
labels = torch.randint(low=0, high=num_classes, size=(N,), dtype=torch.long)
# Test forward pass
loss = cross_entropy(scores, labels)
expected_loss = torch.nn.functional.cross_entropy(scores, labels)
test.assertLess(torch.abs(expected_loss-loss).item(), 1e-5)
print('loss=', loss.item())
# Test backward pass
test_block_grad(cross_entropy, scores, y=labels)
loss= 2.7283618450164795 Comparing gradients... input diff=0.000
Now that we have some working Layers, we can build an MLP model of arbitrary depth and compute end-to-end gradients.
First, lets copy an idea from PyTorch and implement our own version of the nn.Sequential Module.
This is a Layer which contains other Layers and calls them in sequence. We'll use this to build our MLP model.
TODO: Complete the implementation of the Sequential class in the hw2/layers.py module.
# Test Sequential
# Let's create a long sequence of layers and see
# whether we can compute end-to-end gradients of the whole thing.
seq = layers.Sequential(
layers.Linear(in_features, 100),
layers.Linear(100, 200),
layers.Linear(200, 100),
layers.ReLU(),
layers.Linear(100, 500),
layers.LeakyReLU(alpha=0.01),
layers.Linear(500, 200),
layers.ReLU(),
layers.Linear(200, 500),
layers.LeakyReLU(alpha=0.1),
layers.Linear(500, 1),
layers.Sigmoid(),
)
x_test = torch.randn(N, in_features)
# Test forward pass
z = seq(x_test)
test.assertSequenceEqual(z.shape, [N, 1])
# Test backward pass
test_block_grad(seq, x_test)
Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 param#09 diff=0.000 param#10 diff=0.000 param#11 diff=0.000 param#12 diff=0.000 param#13 diff=0.000 param#14 diff=0.000
Now, equipped with a Sequential, all we have to do is create an MLP architecture.
We'll define our MLP with the following hyperparameters:
So the architecture will be:
FC($D$, $h_1$) $\rightarrow$ ReLU $\rightarrow$ FC($h_1$, $h_2$) $\rightarrow$ ReLU $\rightarrow$ $\cdots$ $\rightarrow$ FC($h_{L-1}$, $h_L$) $\rightarrow$ ReLU $\rightarrow$ FC($h_{L}$, $C$)
We'll also create a sequence of the above MLP and a cross-entropy loss, since it's the gradient of the loss that we need in order to train a model.
TODO: Complete the implementation of the MLP class in the hw2/layers.py module. Ignore the dropout parameter for now.
# Create an MLP model
mlp = layers.MLP(in_features, num_classes, hidden_features=[100, 50, 100])
print(mlp)
MLP, Sequential [0] Linear(self.in_features=200, self.out_features=100) [1] ReLU [2] Linear(self.in_features=100, self.out_features=50) [3] ReLU [4] Linear(self.in_features=50, self.out_features=100) [5] ReLU [6] Linear(self.in_features=100, self.out_features=10)
# Test MLP architecture
N = 100
in_features = 10
num_classes = 10
for activation in ('relu', 'sigmoid'):
mlp = layers.MLP(in_features, num_classes, hidden_features=[100, 50, 100], activation=activation)
test.assertEqual(len(mlp.sequence), 7)
num_linear = 0
for b1, b2 in zip(mlp.sequence, mlp.sequence[1:]):
if (str(b2).lower() == activation):
test.assertTrue(str(b1).startswith('Linear'))
num_linear += 1
test.assertTrue(str(mlp.sequence[-1]).startswith('Linear'))
test.assertEqual(num_linear, 3)
# Test MLP gradients
# Test forward pass
x_test = torch.randn(N, in_features)
labels = torch.randint(low=0, high=num_classes, size=(N,), dtype=torch.long)
z = mlp(x_test)
test.assertSequenceEqual(z.shape, [N, num_classes])
# Create a sequence of MLPs and CE loss
seq_mlp = layers.Sequential(mlp, layers.CrossEntropyLoss())
loss = seq_mlp(x_test, y=labels)
test.assertEqual(loss.dim(), 0)
print(f'MLP loss={loss}, activation={activation}')
# Test backward pass
test_block_grad(seq_mlp, x_test, y=labels)
MLP loss=2.30924391746521, activation=relu Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 MLP loss=2.3934404850006104, activation=sigmoid Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000
If the above tests passed then congratulations - you've now implemented an arbitrarily deep model and loss function with end-to-end automatic differentiation!
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Suppose we have a linear (i.e. fully-connected) layer with a weight tensor $\mat{W}$, defined with in_features=1024 and out_features=512. We apply this layer to an input tensor $\mat{X}$ containing a batch of N=64 samples. The output of the layer is denoted as $\mat{Y}$.
Consider the Jacobian tensor $\pderiv{\mat{Y}}{\mat{X}}$ of the output of the layer w.r.t. the input $\mat{X}$.
Consider the Jacobian tensor $\pderiv{\mat{Y}}{\mat{W}}$ of the output of the layer w.r.t. the layer weights $\mat{W}$. Answer questions A-C about it as well.
display_answer(hw2.answers.part1_q1)
1.A:
With the following tensors as described in the question: $$ {Y}\in{R}^{64\times 512} $$ $$ {X}\in{R}^{64\times 1024} $$ We get by calculus that: $$ \frac{\partial Y}{\partial{X}} $$ is of size: $$ 64 \times 512 \times 64 \times 1024 $$
1.B:
The Jacobian is sparse and only the values $\frac{\partial Y_{i,l}}{\partial{X}_{i,k}}$ are non-zero. That is since each output element depends only in the corresponding input element.
1.C:
We don't need to materialize the above Jacobian in order to calculate the downstream gratdient w.r.t. to the input ($\delta{X}$). Since we have the gradient of the output with respect to the loss, denoted as $\delta{Y}=\frac{\partial L}{\partial{Y}}$, using the chain rule we get: $$\delta{X}=\frac{\partial L}{\partial{X}} = \frac{\partial L}{\partial{Y}}\cdot W^{T}$$
2.A:
With the following tensors as described in the question: $$ {Y}\in{R}^{64\times 512} $$ $$ {W}\in{R}^{512\times 1024} $$ We get by calculus that: $$ \frac{\partial Y}{\partial{X}} $$ is of size: $$ 64 \times 512 \times 512 \times 1024 $$
2.B:
Same as above, we get that Jacobian is sparse and only the values $\frac{\partial Y_{i,l}}{\partial{X}_{i,k}}$ are non-zero. That is since each $Y_{i}$ is a linear-combination of the $i_{th}$ row of $W$.
2.C:
Same as above - we don't need to materialize the above Jacobian in order to calculate the downstream gratdient w.r.t. to the input. We again use the chain rule and get: $$\delta{W}=\frac{\partial L}{\partial{W}} = \frac{\partial L}{\partial{Y}}\cdot W^{T}$$
Is back-propagation required in order to train neural networks with decent-based optimization? Why or why not?
display_answer(hw2.answers.part1_q2)
Your answer:
Back-propagation is not required in order to train neural networks with gradient-based optimization. Alternative methods, such as derivative-free optimization techniques like the Nelder-Mead method, exist. However, these methods often lack the efficiency and accuracy provided by back-propagation. While it is technically possible to calculate the entire derivative without using the chain rule, this approach is generally impractical and inefficient. Back-propagation, on the other hand, offers significant advantages in terms of computational efficiency and accuracy. Back-propagation is a specific algorithm that efficiently calculates gradients by leveraging the chain rule, computational graphs, and automatic differentiation. It allows for the propagation of errors through the network and enables the adjustment of network weights based on these gradients.
In this part we will learn how to implement optimization algorithms for deep networks. Additionally, we'll learn how to write training loops and implement a modular model trainer. We'll use our optimizers and training code to test a few configurations for classifying images with an MLP model.
import os
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
In the context of deep learning, an optimization algorithm is some method of iteratively updating model parameters so that the loss converges toward some local minimum (which we hope will be good enough).
Gradient descent-based methods are by far the most popular algorithms for optimization of neural network parameters. However the high-dimensional loss-surfaces we encounter in deep learning applications are highly non-convex. They may be riddled with local minima, saddle points, large plateaus and a host of very challenging "terrain" for gradient-based optimization. This gave rise to many different methods of performing the parameter updates based on the loss gradients, aiming to tackle these optimization challenges.
The most basic gradient-based update rule can be written as,
$$ \vec{\theta} \leftarrow \vec{\theta} - \eta \nabla_{\vec{\theta}} L(\vec{\theta}; \mathcal{D}) $$where $\mathcal{D} = \left\{ (\vec{x}^i, \vec{y}^i) \right\}_{i=1}^{M}$ is our training dataset or part of it. Specifically, if we have in total $N$ training samples, then
The intuition behind gradient descent is simple: since the gradient of a multivariate function points to the direction of steepest ascent ("uphill"), we move in the opposite direction. A small step size $\eta$ known as the learning rate is required since the gradient can only serve as a first-order linear approximation of the function's behaviour at $\vec{\theta}$ (recall e.g. the Taylor expansion). However in truth our loss surface generally has nontrivial curvature caused by a high order nonlinear dependency on $\vec{\theta}$. Thus taking a large step in the direction of the gradient is actually just as likely to increase the function value.
The idea behind the stochastic versions is that by constantly changing the samples we compute the loss with, we get a dynamic error surface, i.e. it's different for each set of training samples. This is thought to generally improve the optimization since it may help the optimizer get out of flat regions or sharp local minima since these features may disappear in the loss surface of subsequent batches. The image below illustrates this. The different lines are different 1-dimensional losses for different training set-samples.
Deep learning frameworks generally provide implementations of various gradient-based optimization algorithms.
Here we'll implement our own optimization module from scratch, this time keeping a similar API to the PyTorch optim package.
We define a base Optimizer class. An optimizer holds a set of parameter tensors (these are the trainable parameters of some model) and maintains internal state. It may be used as follows:
zero_grad() function is invoked to clear the parameter gradients computed by previous iterations.step() function is invoked in order to update the value of each parameter based on it's gradient.The exact method of update is implementation-specific for each optimizer and may depend on its internal state. In addition, adding the regularization penalty to the gradient is handled by the optimizer since it only depends on the parameter values (and not the data).
Here's the API of our Optimizer:
import hw2.optimizers as optimizers
help(optimizers.Optimizer)
Help on class Optimizer in module hw2.optimizers:
class Optimizer(abc.ABC)
| Optimizer(params)
|
| Base class for optimizers.
|
| Method resolution order:
| Optimizer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __init__(self, params)
| :param params: A sequence of model parameters to optimize. Can be a
| list of (param,grad) tuples as returned by the Layers, or a list of
| pytorch tensors in which case the grad will be taken from them.
|
| step(self)
| Updates all the registered parameter values based on their gradients.
|
| zero_grad(self)
| Sets the gradient of the optimized parameters to zero (in place).
|
| ----------------------------------------------------------------------
| Readonly properties defined here:
|
| params
| :return: A sequence of parameter tuples, each tuple containing
| (param_data, param_grad). The data should be updated in-place
| according to the grad.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'step'})
Let's start by implementing the simplest gradient based optimizer. The update rule will be exacly as stated above, but we'll also add a L2-regularization term to the gradient. Remember that in the loss function, the L2 regularization term is expressed by
$$R(\vec{\theta}) = \frac{1}{2}\lambda||\vec{\theta}||^2_2.$$TODO: Complete the implementation of the VanillaSGD class in the hw2/optimizers.py module.
# Test VanillaSGD
torch.manual_seed(42)
p = torch.randn(500, 10)
dp = torch.randn(*p.shape)*2
params = [(p, dp)]
vsgd = optimizers.VanillaSGD(params, learn_rate=0.5, reg=0.1)
vsgd.step()
expected_p = torch.load('tests/assets/expected_vsgd.pt')
diff = torch.norm(p-expected_p).item()
print(f'diff={diff}')
test.assertLess(diff, 1e-3)
diff=1.0932822078757454e-06
Now that we can build a model and loss function, compute their gradients and we have an optimizer, we can finally do some training!
In the spirit of more modular software design, we'll implement a class that will aid us in automating the repetitive training loop code that we usually write over and over again. This will be useful for both training our Layer-based models and also later for training PyTorch nn.Modules.
Here's our Trainer API:
import hw2.training as training
help(training.Trainer)
Help on class Trainer in module hw2.training:
class Trainer(abc.ABC)
| Trainer(model: torch.nn.modules.module.Module, device: Union[torch.device, NoneType] = None)
|
| A class abstracting the various tasks of training models.
|
| Provides methods at multiple levels of granularity:
| - Multiple epochs (fit)
| - Single epoch (train_epoch/test_epoch)
| - Single batch (train_batch/test_batch)
|
| Method resolution order:
| Trainer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __init__(self, model: torch.nn.modules.module.Module, device: Union[torch.device, NoneType] = None)
| Initialize the trainer.
| :param model: Instance of the model to train.
| :param device: torch.device to run training on (CPU or GPU).
|
| fit(self, dl_train: torch.utils.data.dataloader.DataLoader, dl_test: torch.utils.data.dataloader.DataLoader, num_epochs: int, checkpoints: str = None, early_stopping: int = None, print_every: int = 1, **kw) -> cs236781.train_results.FitResult
| Trains the model for multiple epochs with a given training set,
| and calculates validation loss over a given validation set.
| :param dl_train: Dataloader for the training set.
| :param dl_test: Dataloader for the test set.
| :param num_epochs: Number of epochs to train for.
| :param checkpoints: Whether to save model to file every time the
| test set accuracy improves. Should be a string containing a
| filename without extension.
| :param early_stopping: Whether to stop training early if there is no
| test loss improvement for this number of epochs.
| :param print_every: Print progress every this number of epochs.
| :return: A FitResult object containing train and test losses per epoch.
|
| save_checkpoint(self, checkpoint_filename: str)
| Saves the model in it's current state to a file with the given name (treated
| as a relative path).
| :param checkpoint_filename: File name or relative path to save to.
|
| test_batch(self, batch) -> cs236781.train_results.BatchResult
| Runs a single batch forward through the model and calculates loss.
| :param batch: A single batch of data from a data loader (might
| be a tuple of data and labels or anything else depending on
| the underlying dataset.
| :return: A BatchResult containing the value of the loss function and
| the number of correctly classified samples in the batch.
|
| test_epoch(self, dl_test: torch.utils.data.dataloader.DataLoader, **kw) -> cs236781.train_results.EpochResult
| Evaluate model once over a test set (single epoch).
| :param dl_test: DataLoader for the test set.
| :param kw: Keyword args supported by _foreach_batch.
| :return: An EpochResult for the epoch.
|
| train_batch(self, batch) -> cs236781.train_results.BatchResult
| Runs a single batch forward through the model, calculates loss,
| preforms back-propagation and updates weights.
| :param batch: A single batch of data from a data loader (might
| be a tuple of data and labels or anything else depending on
| the underlying dataset.
| :return: A BatchResult containing the value of the loss function and
| the number of correctly classified samples in the batch.
|
| train_epoch(self, dl_train: torch.utils.data.dataloader.DataLoader, **kw) -> cs236781.train_results.EpochResult
| Train once over a training set (single epoch).
| :param dl_train: DataLoader for the training set.
| :param kw: Keyword args supported by _foreach_batch.
| :return: An EpochResult for the epoch.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'test_batch', 'train_batch'})
The Trainer class splits the task of training (and evaluating) models into three conceptual levels,
fit method, which returns a FitResult containing losses and accuracies for all epochs.train_epoch and test_epoch methods, which return an EpochResult containing losses per batch and the single accuracy result of the epoch.train_batch and test_batch methods, which return a BatchResult containing a single loss and the number of correctly classified samples in the batch.It implements the first two levels. Inheriting classes are expected to implement the single-batch level methods since these are model and/or task specific.
The first thing we should do in order to verify our model, gradient calculations and optimizer implementation is to try to overfit a large model (many parameters) to a small dataset (few images). This will show us that things are working properly.
Let's begin by loading the CIFAR-10 dataset.
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())
print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')
Files already downloaded and verified Files already downloaded and verified Train: 50000 samples Test: 10000 samples
Now, let's implement just a small part of our training logic since that's what we need right now.
TODO:
train_batch() method in the LayerTrainer class within the hw2/training.py module.part2_overfit_hp() function in the hw2/answers.py module. Tweak the hyperparameter values until your model overfits a small number of samples in the code block below. You should get 100% accuracy within a few epochs.The following code block will use your custom Layer-based MLP implentation, custom Vanilla SGD and custom trainer to overfit the data. The classification accuracy should be 100% within a few epochs.
import hw2.layers as layers
import hw2.answers as answers
from torch.utils.data import DataLoader
# Overfit to a very small dataset of 20 samples
batch_size = 10
max_batches = 2
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
# Get hyperparameters
hp = answers.part2_overfit_hp()
torch.manual_seed(seed)
# Build a model and loss using our custom MLP and CE implementations
model = layers.MLP(3*32*32, num_classes=10, hidden_features=[128]*3, wstd=hp['wstd'])
loss_fn = layers.CrossEntropyLoss()
# Use our custom optimizer
optimizer = optimizers.VanillaSGD(model.params(), learn_rate=hp['lr'], reg=hp['reg'])
# Run training over small dataset multiple times
trainer = training.LayerTrainer(model, loss_fn, optimizer)
best_acc = 0
for i in range(20):
res = trainer.train_epoch(dl_train, max_batches=max_batches)
best_acc = res.accuracy if res.accuracy > best_acc else best_acc
test.assertGreaterEqual(best_acc, 98)
Now that we know training works, let's try to fit a model to a bit more data for a few epochs, to see how well we're doing. First, we need a function to plot the FitResults object.
from cs236781.plot import plot_fit
plot_fit?
TODO:
test_batch() method in the LayerTrainer class within the hw2/training.py module.fit() method of the Trainer class within the hw2/training.py module.part2_optim_hp() function in the hw2/answers.py module.# Define a larger part of the CIFAR-10 dataset (still not the whole thing)
batch_size = 50
max_batches = 100
in_features = 3*32*32
num_classes = 10
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size//2, shuffle=False)
# Define a function to train a model with our Trainer and various optimizers
def train_with_optimizer(opt_name, opt_class, fig):
torch.manual_seed(seed)
# Get hyperparameters
hp = answers.part2_optim_hp()
hidden_features = [128] * 5
num_epochs = 10
# Create model, loss and optimizer instances
model = layers.MLP(in_features, num_classes, hidden_features, wstd=hp['wstd'])
loss_fn = layers.CrossEntropyLoss()
optimizer = opt_class(model.params(), learn_rate=hp[f'lr_{opt_name}'], reg=hp['reg'])
# Train with the Trainer
trainer = training.LayerTrainer(model, loss_fn, optimizer)
fit_res = trainer.fit(dl_train, dl_test, num_epochs, max_batches=max_batches)
fig, axes = plot_fit(fit_res, fig=fig, legend=opt_name)
return fig
fig_optim = None
fig_optim = train_with_optimizer('vanilla', optimizers.VanillaSGD, fig_optim)
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
The simple vanilla SGD update is rarely used in practice since it's very slow to converge relative to other optimization algorithms.
One reason is that naïvely updating in the direction of the current gradient causes it to fluctuate wildly in areas where the loss surface in some dimensions is much steeper than in others. Another reason is that using the same learning rate for all parameters is not a great idea since not all parameters are created equal. For example, parameters associated with rare features should be updated with a larger step than ones associated with commonly-occurring features because they'll get less updates through the gradients.
Therefore more advanced optimizers take into account the previous gradients of a parameter and/or try to use a per-parameter specific learning rate instead of a common one.
Let's now implement a simple and common optimizer: SGD with Momentum. This optimizer takes previous gradients of a parameter into account when updating it's value instead of just the current one. In practice it usually provides faster convergence than the vanilla SGD.
The SGD with Momentum update rule can be stated as follows: $$\begin{align} \vec{v}_{t+1} &= \mu \vec{v}_t - \eta \delta \vec{\theta}_t \\ \vec{\theta}_{t+1} &= \vec{\theta}_t + \vec{v}_{t+1} \end{align}$$
Where $\eta$ is the learning rate, $\vec{\theta}$ is a model parameter, $\delta \vec{\theta}_t=\pderiv{L}{\vec{\theta}}(\vec{\theta}_t)$ is the gradient of the loss w.r.t. to the parameter and $0\leq\mu<1$ is a hyperparameter known as momentum.
Expanding the update rule recursively shows us now the parameter update infact depends on all previous gradient values for that parameter, where the old gradients are exponentially decayed by a factor of $\mu$ at each timestep.
Since we're incorporating previous gradient (update directions), a noisy value of the current gradient will have less effect so that the general direction of previous updates is maintained somewhat. The following figure illustrates this.
TODO:
MomentumSGD class in the hw2/optimizers.py module.part2_optim_hp() the function in the hw2/answers.py module.fig_optim = train_with_optimizer('momentum', optimizers.MomentumSGD, fig_optim)
fig_optim
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
This is another optmizer that accounts for previous gradients, but this time it uses them to adapt the learning rate per parameter.
RMSProp maintains a decaying moving average of previous squared gradients, $$ \vec{r}_{t+1} = \gamma\vec{r}_{t} + (1-\gamma)\delta\vec{\theta}_t^2 $$ where $0<\gamma<1$ is a decay constant usually set close to $1$, and $\delta\vec{\theta}_t^2$ denotes element-wise squaring.
The update rule for each parameter is then, $$ \vec{\theta}_{t+1} = \vec{\theta}_t - \left( \frac{\eta}{\sqrt{r_{t+1}+\varepsilon}} \right) \delta\vec{\theta}_t $$
where $\varepsilon$ is a small constant to prevent numerical instability. The idea here is to decrease the learning rate for parameters with high gradient values and vice-versa. The decaying moving average prevents accumulating all the past gradients which would cause the effective learning rate to become zero.
Bonus:
RMSProp class in the hw2/optimizers.py module.part2_optim_hp() the function in the hw2/answers.py module.fig_optim = train_with_optimizer('rmsprop', optimizers.RMSProp, fig_optim)
fig_optim
--- EPOCH 1/10 ---
--- EPOCH 2/10 ---
--- EPOCH 3/10 ---
--- EPOCH 4/10 ---
--- EPOCH 5/10 ---
--- EPOCH 6/10 ---
--- EPOCH 7/10 ---
--- EPOCH 8/10 ---
--- EPOCH 9/10 ---
--- EPOCH 10/10 ---
Note that you should get better train/test accuracy with Momentum and RMSProp than Vanilla.
Dropout is a useful technique to improve generalization of deep models.
The idea is simple: during the forward pass drop, i.e. set to to zero, the activation of each neuron, with a probability of $p$. For example, if $p=0.4$ this means we drop the activations of 40% of the neurons (on average).
There are a few important things to note about dropout:
TODO:
Dropout class in the hw2/layers.py module.MLP's __init__() method in the hw2/layers.py module.
If dropout>0 you should add a Dropout layer after each ReLU.from hw2.grad_compare import compare_layer_to_torch
# Check architecture of MLP with dropout layers
mlp_dropout = layers.MLP(in_features, num_classes, [50]*3, dropout=0.6)
print(mlp_dropout)
test.assertEqual(len(mlp_dropout.sequence), 10)
for b1, b2 in zip(mlp_dropout.sequence, mlp_dropout.sequence[1:]):
if str(b1).lower() == 'relu':
test.assertTrue(str(b2).startswith('Dropout'))
test.assertTrue(str(mlp_dropout.sequence[-1]).startswith('Linear'))
MLP, Sequential [0] Linear(self.in_features=3072, self.out_features=50) [1] ReLU [2] Dropout(p=0.6) [3] Linear(self.in_features=50, self.out_features=50) [4] ReLU [5] Dropout(p=0.6) [6] Linear(self.in_features=50, self.out_features=50) [7] ReLU [8] Dropout(p=0.6) [9] Linear(self.in_features=50, self.out_features=10)
# Test end-to-end gradient in train and test modes.
print('Dropout, train mode')
mlp_dropout.train(True)
for diff in compare_layer_to_torch(mlp_dropout, torch.randn(500, in_features)):
test.assertLess(diff, 1e-3)
print('Dropout, test mode')
mlp_dropout.train(False)
for diff in compare_layer_to_torch(mlp_dropout, torch.randn(500, in_features)):
test.assertLess(diff, 1e-3)
Dropout, train mode Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 Dropout, test mode Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000
To see whether dropout really improves generalization, let's take a small training set (small enough to overfit) and a large test set and check whether we get less overfitting and perhaps improved test-set accuracy when using dropout.
# Define a small set from CIFAR-10, but take a larger test set since we want to test generalization
batch_size = 10
max_batches = 40
in_features = 3*32*32
num_classes = 10
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size*2, shuffle=False)
TODO:
Tweak the hyperparameters for this section in the part2_dropout_hp() function in the hw2/answers.py module. Try to set them so that the first model (with dropout=0) overfits. You can disable the other dropout options until you tune the hyperparameters. We can then see the effect of dropout for generalization.
# Get hyperparameters
hp = answers.part2_dropout_hp()
hidden_features = [400] * 1
num_epochs = 30
torch.manual_seed(seed)
fig=None
#for dropout in [0]: # Use this for tuning the hyperparms until you overfit
for dropout in [0, 0.4, 0.8]:
model = layers.MLP(in_features, num_classes, hidden_features, wstd=hp['wstd'], dropout=dropout)
loss_fn = layers.CrossEntropyLoss()
optimizer = optimizers.MomentumSGD(model.params(), learn_rate=hp['lr'], reg=0)
print('*** Training with dropout=', dropout)
trainer = training.LayerTrainer(model, loss_fn, optimizer)
fit_res_dropout = trainer.fit(dl_train, dl_test, num_epochs, max_batches=max_batches, print_every=6)
fig, axes = plot_fit(fit_res_dropout, fig=fig, legend=f'dropout={dropout}', log_loss=True)
*** Training with dropout= 0 --- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
*** Training with dropout= 0.4 --- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
*** Training with dropout= 0.8 --- EPOCH 1/30 ---
--- EPOCH 7/30 ---
--- EPOCH 13/30 ---
--- EPOCH 19/30 ---
--- EPOCH 25/30 ---
--- EPOCH 30/30 ---
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Regarding the graphs you got for the three dropout configurations:
Explain the graphs of no-dropout vs dropout. Do they match what you expected to see?
Compare the low-dropout setting to the high-dropout setting and explain based on your graphs.
display_answer(hw2.answers.part2_q1)
Your answer:
1.1.
The graphs comparing the no-dropout and dropout configurations align with our expectations. Without dropout, the training accuracy remains more stable across epochs, showing fewer fluctuations compared to the dropout configurations. This is because all neurons are active during training without dropout, leading to stronger and more focused connections within the network. As a result, the model tends to perform well on the training data, resulting in higher training accuracy. On the other hand, the introduction of dropout introduces more spikes in the accuracy curves between epochs. Dropout randomly "drops out" a fraction of neurons during each training iteration, encouraging the network to distribute the representation across multiple sets of neurons. This promotes better generalization and reduces overfitting. As a consequence, the model may exhibit lower training accuracy compared to the no-dropout case. In terms of test accuracy, the results show a trade-off between the dropout configurations and the no-dropout case. Without dropout, the model achieves higher test accuracy since it has learned strong and specific connections tailored to the training data. However, this can lead to overfitting, where the model fails to generalize well to unseen data.
1.2.
As the dropout rate increases, we observe a decrease in test accuracy but an improvement in generalization. For example, comparing a low dropout rate (e.g., 0.4) to no dropout, the test accuracy may slightly decrease, but it helps in reducing overfitting and improving performance on unseen data. Dropout prevents the model from relying too heavily on specific neurons, encouraging the network to learn more robust and generalizable features. However, it's important to note that extremely high dropout rates, such as 0.8, can harm both training and testing accuracy. With a dropout rate of 0.8, a significant portion of neurons is disabled during training, severely limiting the model's capacity to learn meaningful patterns from the data. As a result, the model's performance is likely to suffer, yielding poor results for both training and testing.
When training a model with the cross-entropy loss function, is it possible for the test loss to increase for a few epochs while the test accuracy also increases?
If it's possible explain how, if it's not explain why not.
display_answer(hw2.answers.part2_q2)
Your answer:
Yes, it is possible. It might ocuur in some scenarios and that is due to the behavior of the Cross-Entropy loss function.
The Cross-Entropy loss function peneltizes for predicted labels $\hat{y}$ that are far (in terms of distance) from the true labels $y$. But the accuracy only cares for the predicted labels to be equal to the true labels.
For example, we look at some $y_1 = \hat{y}_1$ and $y_2 \ne \hat{y}_2, y_3 \ne \hat{y}_3$ that are very close to each other. Then, in that epoch, $\hat{y}_2, \hat{y}_3$ increase a bit, such that $y_2 = \hat{y}_2, y_3 = \hat{y}_3$ and $y_1$ decreases a lot.
In that case, all the predicted label have not changed except for two that are now equal to their true labels and one that now is not. Thus the test accuracy increases.
On the other hand, we get in total greater distances between the predicted and the true labels and thus the loss also increses.
Explain the difference between gradient descent and back-propagation.
Compare in detail between gradient descent (GD) and stochastic gradient descent (SGD).
Why is SGD used more often in the practice of deep learning? Provide a few justifications.
You would like to try GD to train your model instead of SGD, but you're concerned that your dataset won't fit in memory. A friend suggested that you should split the data into disjoint batches, do multiple forward passes until all data is exhausted, and then do one backward pass on the sum of the losses.
display_answer(hw2.answers.part2_q3)
Your answer:
3.1:
Gradient descent is an algorithm used for optimization. GD used to minimize the loss function, by updating iteratively the hyper parameters.
Backpropagation is an algorithm used for efficient calculation (using the chain rule) of the derivatives of the loss function w.r.t the parameters.
3.2:
GD and SGD are both optimization algorithms used in training machine learning models. They differ in the way they updating the model's parameters:
In GD, in each epoch, the algorithm cosider the whole dataset $X$ to decide and perform the step. on the contrary, in GSD, in each epoch the algorithm samples a subset of the dataset $X$, which its size is $BatchSize$, and only take that subset into considration whlie deciding the step.
The GD is more robust and less sensitive to noicy data, compare to GSD, Because it uses the entire data. Furthermore, GSD might not get to the minima (the solution may fluctuate around the optimal point), Because it is effected by individual samples. Nevertheless, The GSD take smaller steps and thus converges faster than GD.
3.3:
Following are some reasons for that:
- SGD uses only a part of the dataset, allowing efficient and possible optimization for big datasets, which is sometimes not even possible for them to run GD.
- SGD uses different samples in each epoch. That allows it sometimes to converge better, since it can ignore some noisy data wich might lead to bad steps.
- SGD converges faster, since it takes smaller steps, as we explained above.
Let $f = f_n \circ f_{n-1} \circ ... \circ f_1$ where each $f_i: \mathbb{R} \rightarrow \mathbb{R}$ is a differentiable function which is easy to evaluate and differentiate (each query costs $\mathcal{O}(1)$ at a given point).
Assume that you are given with $f$ already expressed as a computational graph and a point $x_0$. 1. Show how to reduce the memory complexity for computing the gradient using forward mode AD (maintaining the $\mathcal{O}(n)$ computation cost). What is the memory complexity? 2. Show how to reduce the memory complexity for computing the gradient using backward mode AD (maintaining the $\mathcal{O}(n)$ computation cost). What is the memory complexity? 2. Can these techniques be generalized for arbitrary computational graphs? 3. Think how the backprop algorithm can benefit from these techniques when applied to deep architectures (e.g VGGs, ResNets).
display_answer(hw2.answers.part2_q4)
Your answer:
4.1.
To reduce the memory complexity for computing the gradient using forward mode Automatic Differentiation (AD) while maintaining O(n) computation cost, we can employ a technique called checkpointing. The idea is to only store the essential values needed for gradient computation instead of storing all intermediate results. In the first approach, the algorithm initializes two variables, currentGradient and currentResult, and iterates through the computational graph in a forward pass. At each step, it computes the derivative of the current function and updates the gradient and result accordingly. By only storing the current gradient and result, the memory complexity is reduced to O(1). However, if the intermediate results are not already given, the memory complexity becomes O(n) as we need to store all the intermediate results.
4.2.
Similarly, for backward mode AD, we can use checkpointing to reduce memory complexity while maintaining O(n) computation cost. The second approach initializes two holders, backwardGradient and backwardResult, and performs a forward pass through the computational graph, saving the intermediate function results. Then, in the backward pass, it iterates from the end to the beginning, computing the gradients based on the saved function results. By only storing the necessary function results, the memory complexity is reduced to O(1).
4.3.
These techniques leverage the concept of checkpointing to minimize memory usage during gradient computation. However, it's important to note that these techniques assume a sequential execution of functions and may not be optimal for computational graphs with parallel executions. The memory complexity reduction to O(1) holds under the assumption of a sequential execution. These memory optimization techniques can be generalized for arbitrary computational graphs by employing checkpointing and breaking down the graph into subgraphs. By computing the gradient of each subgraph separately and combining them, we can reduce the memory complexity of the overall gradient computation to O(1). As long as we store only the essential values needed for gradient computation, we can handle large and complex computational graphs efficiently.
4.4.
In the context of deep architectures such as VGGs and ResNets, these memory optimization techniques offer significant benefits. These architectures often have a large number of parameters and layers, resulting in high memory requirements. By reducing the memory complexity of gradient computation to O(1), we can alleviate the memory burden during training. This advantage becomes crucial, as it enables efficient training on hardware with limited memory resources. Additionally, the reduced memory complexity allows for faster and more scalable training, improving the overall training time of deep architectures.
In this part we'll implement a general purpose MLP and Binary Classifier using pytorch.
We'll implement its training, and also learn about decision boundaries an threshold selection in the context of binary classification. Finally, we'll explore the effect of depth and width on an MLP's performance.
import os
import re
import sys
import glob
import unittest
from typing import Sequence, Tuple
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as tvtf
from torch import Tensor
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
To test our first neural network-based classifiers we'll start by creating a toy binary classification dataset, but one which is not trivial for a linear model.
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
def rotate_2d(X, deg=0):
"""
Rotates each 2d sample in X of shape (N, 2) by deg degrees.
"""
a = np.deg2rad(deg)
return X @ np.array([[np.cos(a), -np.sin(a)],[np.sin(a), np.cos(a)]]).T
def plot_dataset_2d(X, y, n_classes=2, alpha=0.2, figsize=(8, 6), title=None, ax=None):
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
for c in range(n_classes):
ax.scatter(*X[y==c,:].T, alpha=alpha, label=f"class {c}");
ax.set_xlabel("$x_1$"); ax.set_ylabel("$x_2$");
ax.legend(); ax.set_title((title or '') + f" (n={len(y)})")
We'll split our data into 80% train and validation, and 20% test. To make it a bit more challenging, we'll simulate a somewhat real-world setting where there are multiple populations, and the training/validation data is not sampled iid from the underlying data distribution.
np.random.seed(seed)
N = 10_000
N_train = int(N * .8)
# Create data from two different distributions for the training/validation
X1, y1 = make_moons(n_samples=N_train//2, noise=0.2)
X1 = rotate_2d(X1, deg=10)
X2, y2 = make_moons(n_samples=N_train//2, noise=0.25)
X2 = rotate_2d(X2, deg=50)
# Test data comes from a similar but noisier distribution
X3, y3 = make_moons(n_samples=(N-N_train), noise=0.3)
X3 = rotate_2d(X3, deg=40)
X, y = np.vstack([X1, X2, X3]), np.hstack([y1, y2, y3])
# Train and validation data is from mixture distribution
X_train, X_valid, y_train, y_valid = train_test_split(X[:N_train, :], y[:N_train], test_size=1/3, shuffle=False)
# Test data is only from the second distribution
X_test, y_test = X[N_train:, :], y[N_train:]
fig, ax = plt.subplots(1, 3, figsize=(20, 5))
plot_dataset_2d(X_train, y_train, title='Train', ax=ax[0]);
plot_dataset_2d(X_valid, y_valid, title='Validation', ax=ax[1]);
plot_dataset_2d(X_test, y_test, title='Test', ax=ax[2]);
Now let us create a data loader for each dataset.
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
batch_size = 32
dl_train, dl_valid, dl_test = [
DataLoader(
dataset=TensorDataset(
torch.from_numpy(X_).to(torch.float32),
torch.from_numpy(y_)
),
shuffle=True,
num_workers=0,
batch_size=batch_size
)
for X_, y_ in [(X_train, y_train), (X_valid, y_valid), (X_test, y_test)]
]
print(f'{len(dl_train.dataset)=}, {len(dl_valid.dataset)=}, {len(dl_test.dataset)=}')
len(dl_train.dataset)=5333, len(dl_valid.dataset)=2667, len(dl_test.dataset)=2000
A multilayer-perceptron is arguably a the most basic type of neural network model. It is composed of $L$ layers, each layer $l$ with $n_l$ perceptron ("neuron") units. Each perceptron is connected to all ouputs of the previous layer (or all inputs in the first layer), calculates their weighted sum, applies a linearity and produces a single output.

Each layer $l$ operates on the output of the previous layer ($\vec{y}_{l-1}$) and calculates:
$$ \vec{y}_l = \varphi\left( \mat{W}_l \vec{y}_{l-1} + \vec{b}_l \right),~ \mat{W}_l\in\set{R}^{n_{l}\times n_{l-1}},~ \vec{b}_l\in\set{R}^{n_l},~ l \in \{1,2,\dots,L\}. $$To begin, let's implement a general multi-layer perceptron model. We'll seek to implement it in a way which is both general in terms of architecture, and also composable so that we can use our MLP in the context of larger models.
TODO: Implement the MLP class in the hw2/mlp.py module.
from hw2.mlp import MLP
mlp = MLP(
in_dim=2,
dims=[8, 16, 32, 64],
nonlins=['relu', 'tanh', nn.LeakyReLU(0.314), 'softmax']
)
mlp
MLP(
(layers): Sequential(
(0): Linear(in_features=2, out_features=8, bias=True)
(1): ReLU()
(2): Linear(in_features=8, out_features=16, bias=True)
(3): Tanh()
(4): Linear(in_features=16, out_features=32, bias=True)
(5): LeakyReLU(negative_slope=0.314)
(6): Linear(in_features=32, out_features=64, bias=True)
(7): Softmax(dim=1)
)
)
Let's try our implementation on a batch of data.
x0, y0 = next(iter(dl_train))
yhat0 = mlp(x0)
test.assertEqual(len([*mlp.parameters()]), 8)
test.assertEqual(yhat0.shape, (batch_size, mlp.out_dim))
test.assertTrue(torch.allclose(torch.sum(yhat0, dim=1), torch.tensor(1.0)))
test.assertIsNotNone(yhat0.grad_fn)
yhat0
tensor([[0.0139, 0.0156, 0.0162, ..., 0.0157, 0.0146, 0.0165],
[0.0142, 0.0170, 0.0172, ..., 0.0147, 0.0154, 0.0170],
[0.0145, 0.0166, 0.0174, ..., 0.0144, 0.0156, 0.0168],
...,
[0.0144, 0.0172, 0.0176, ..., 0.0144, 0.0157, 0.0170],
[0.0147, 0.0163, 0.0175, ..., 0.0143, 0.0155, 0.0168],
[0.0138, 0.0170, 0.0167, ..., 0.0153, 0.0148, 0.0167]],
grad_fn=<SoftmaxBackward0>)
The MLP model we've implemented, while useful, is very general. For the task of binary classification, we would like to add some additional functionality to it: the ability to output a normalized score for a sample being in class one (which we interpret as a probability) and a prediction based on some threshold of this probability. In addition, we need some way to calculate a meaningful threshold based on the data and a trained model at hand.
In order to maintain generality, we'll add this functionlity in the form of a wrapper: A BinaryClassifier class that can wrap any model producing two output features, and provide the the functionality stated above.
TODO: In the hw2/classifier.py module, implement the BinaryClassifier and the missing parts of its base class, Classifier. Read the method documentation carefully and implement accordingly.
You can ignore the roc_threshold method at this stage.
from hw2.classifier import BinaryClassifier
bmlp4 = BinaryClassifier(
model=MLP(in_dim=2, dims=[*[10]*3, 2], nonlins=[*['relu']*3, 'none']),
threshold=0.5
)
print(bmlp4)
# Test model
test.assertEqual(len([*bmlp4.parameters()]), 8)
test.assertIsNotNone(bmlp4(x0).grad_fn)
# Test forward
yhat0_scores = bmlp4(x0)
test.assertEqual(yhat0_scores.shape, (batch_size, 2))
test.assertFalse(torch.allclose(torch.sum(yhat0_scores, dim=1), torch.tensor(1.0)))
# Test predict_proba
yhat0_proba = bmlp4.predict_proba(x0)
test.assertEqual(yhat0_proba.shape, (batch_size, 2))
test.assertTrue(torch.allclose(torch.sum(yhat0_proba, dim=1), torch.tensor(1.0)))
# Test classify
yhat0 = bmlp4.classify(x0)
test.assertEqual(yhat0.shape, (batch_size,))
test.assertEqual(yhat0.dtype, torch.int)
test.assertTrue(all(yh_ in (0, 1) for yh_ in yhat0))
BinaryClassifier(
(model): MLP(
(layers): Sequential(
(0): Linear(in_features=2, out_features=10, bias=True)
(1): ReLU()
(2): Linear(in_features=10, out_features=10, bias=True)
(3): ReLU()
(4): Linear(in_features=10, out_features=10, bias=True)
(5): ReLU()
(6): Linear(in_features=10, out_features=2, bias=True)
(7): Identity()
)
)
)
Now that we have a classifier, we need to train it.
We will abstract the various aspects of training such as mlutiple epochs, iterating over batches, early stopping and saving model checkpoints, into a Trainer that will take care of these concerns.
The Trainer class splits the task of training (and evaluating) models into three conceptual levels,
fit method, which returns a FitResult containing losses and accuracies for all epochs.train_epoch and test_epoch methods, which return an EpochResult containing losses per batch and the single accuracy result of the epoch.train_batch and test_batch methods, which return a BatchResult containing a single loss and the number of correctly classified samples in the batch.It implements the first two levels. Inheriting classes are expected to implement the single-batch level methods since these are model and/or task specific.
TODO:
Implement the Trainer's fit method and the ClassifierTrainer's train_batch/test_batch methods, in the hw2/training.py module. You may ignore the Optional parts about early stopping an model checkpoints at this stage.
Set the model's architecture hyper-parameters and the optimizer hyperparameters in part3_arch_hp() and part3_optim_hp(), respectively, in hw2/answers.py.
Since this is a toy dataset, you should be able to quickly get above 85% accuracy even on the test set.
from hw2.training import ClassifierTrainer
from hw2.answers import part3_arch_hp, part3_optim_hp
torch.manual_seed(seed)
hp_arch = part3_arch_hp()
hp_optim = part3_optim_hp()
model = BinaryClassifier(
model=MLP(
in_dim=2,
dims=[*[hp_arch['hidden_dims'],]*hp_arch['n_layers'], 2],
nonlins=[*[hp_arch['activation'],]*hp_arch['n_layers'], hp_arch['out_activation']]
),
threshold=0.5,
)
print(model)
loss_fn = hp_optim.pop('loss_fn')
optimizer = torch.optim.SGD(params=model.parameters(), **hp_optim)
trainer = ClassifierTrainer(model, loss_fn, optimizer)
fit_result = trainer.fit(dl_train, dl_valid, num_epochs=20, print_every=10);
test.assertGreaterEqual(fit_result.train_acc[-1], 85.0)
test.assertGreaterEqual(fit_result.test_acc[-1], 75.0)
BinaryClassifier(
(model): MLP(
(layers): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=512, bias=True)
(5): ReLU()
(6): Linear(in_features=512, out_features=512, bias=True)
(7): ReLU()
(8): Linear(in_features=512, out_features=2, bias=True)
(9): ReLU()
)
)
)
--- EPOCH 1/20 ---
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
--- EPOCH 11/20 ---
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
--- EPOCH 20/20 ---
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
from cs236781.plot import plot_fit
plot_fit(fit_result, log_loss=False, train_test_overlay=True);
An important part of understanding what a non-linear classifier like our MLP is doing is visualizing it's decision boundaries. When we only have two input features, these are relatively simple to visualize, since we can simply plot our data on the plane, and evaluate our classifier on a constant 2D grid in order to approximate the decision boundary.
TODO: Implement the plot_decision_boundary_2d function in the hw2/classifier.py module.
from hw2.classifier import plot_decision_boundary_2d
fig, ax = plot_decision_boundary_2d(model, *dl_valid.dataset.tensors)
/home/hay.e/miniconda3/envs/cs236781-hw2/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1639180588308/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Another important component, especially in the context of binary classification is threshold selection. Until now, we arbitrarily chose a threshold of 0.5 when deciding the class label based on the probability score we calculated via softmax. In other words, we classified a sample to class 1 (the 'positive' class) when it's probability score was greater or equal to 0.5.
However, in real-world classifiction problems we'll need to choose our threshold wisely based on the domain-specific requirements of the problem. For example, depending on our application, we might care more about high sensitivity (correctly classifying positive examples), while for other applications specificity (correctly classifying negative examples) is more important.
One way to understand the mistakes a model is making is to look at its Confusion Matrix. From it, we easily see e.g. the false-negative rate (FNR) and false-positive rate (FPR).
Let's look at the confusion matrices on the test and validation data using the model we trained above.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
def plot_confusion(classifier, x: np.ndarray, y: np.ndarray, ax=None):
y_hat = classifier.classify(torch.from_numpy(x).to(torch.float32)).numpy()
conf_mat = confusion_matrix(y, y_hat, normalize='all')
ConfusionMatrixDisplay(conf_mat).plot(ax=ax, colorbar=False)
model.threshold = 0.5
_, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].set_title("Train"); axes[1].set_title("Validation");
plot_confusion(model, X_train, y_train, ax=axes[0])
plot_confusion(model, X_valid, y_valid, ax=axes[1])
We can see that the model makes a different number of false-posiive and false-negative errors. Clearly, this proportion would change if the classification threshold was different.
A very common way to select the classification threshold is to find a threshold which optimally balances between the FPR and FNR.
This can be done by plotting the model's ROC curve, which shows 1-FNR vs. FPR for multiple threshold values, and selecting the point closest to the ideal point ((0, 1)).
TODO: Implement the select_roc_thresh function in the hw2.classifier module.
from hw2.classifier import select_roc_thresh
optimal_thresh = select_roc_thresh(model, *dl_valid.dataset.tensors, plot=True)
Let's see the effect of our threshold selection on the confusion matrix and decision boundary.
model.threshold = optimal_thresh
_, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].set_title("Train"); axes[1].set_title("Validation");
plot_confusion(model, X_train, y_train, ax=axes[0])
plot_confusion(model, X_valid, y_valid, ax=axes[1])
fig, ax = plot_decision_boundary_2d(model, *dl_valid.dataset.tensors)
Now, equipped with the tools we've implemented so far we'll expertiment with various MLP architectures. We'll seek to study the effect of the models depth (number of hidden layers) and width (number of neurons per hidden layer) on the its decision boundaries and the resulting performance. After training, we will use the validation set for threshold selection, and seek to maximize the performance on the test set.
TODO: Implement the mlp_experiment function in hw2/experiments.py.
You are free to configure any model and optimization hyperparameters however you like, except for the specified width and depth.
Experiment with various options for these other hyperparameters and try to obtain the best results you can.
from itertools import product
from tqdm.auto import tqdm
from hw2.experiments import mlp_experiment
torch.manual_seed(seed)
depths = [1, 2, 4]
widths = [2, 8, 32]
exp_configs = product(enumerate(widths), enumerate(depths))
fig, axes = plt.subplots(len(widths), len(depths), figsize=(10*len(depths), 10*len(widths)), squeeze=False)
test_accs = []
for (i, width), (j, depth) in tqdm(list(exp_configs)):
model, thresh, valid_acc, test_acc = mlp_experiment(
depth, width, dl_train, dl_valid, dl_test, n_epochs=10
)
test_accs.append(test_acc)
fig, ax = plot_decision_boundary_2d(model, *dl_test.dataset.tensors, ax=axes[i, j])
ax.set_title(f"{depth=}, {width=}")
ax.text(ax.get_xlim()[0]*.95, ax.get_ylim()[1]*.95, f"{thresh=:.2f}\n{valid_acc=:.1f}%\n{test_acc=:.1f}%", va="top")
# Assert minimal performance requirements.
# You should be able to do better than these by at least 5%.
test.assertGreaterEqual(np.min(test_accs), 75.0)
test.assertGreaterEqual(np.quantile(test_accs, 0.75), 85.0)
0%| | 0/9 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
train_batch: 0%| | 0/167 [00:00<?, ?it/s]
test_batch: 0%| | 0/84 [00:00<?, ?it/s]
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Consider the first binary classifier you trained in this notebook and the loss/accuracy curves we plotted for it on the train and validation sets, as well as the decision boundary plot.
Based on those plots, explain qualitatively whether or now your model has:
Explain your answers for each of the above. Since this is a qualitative question, assume "high" simply means "I would take measures in order to decrease it further".
display_answer(hw2.answers.part3_q1)
Your answer:
Our model has low optimization error. This conclusion is drawn from observing that the loss graph decreases and the accuracy graph increases for the validation set as the model learns. The decreasing loss indicates that the model is effectively optimizing its parameters to minimize the discrepancy between predicted and actual values.
Our model has low generalization error. This conclusion is based on several factors. Firstly, the validation graph shows similar trends to the train graph, suggesting that the model performs well on unseen data. Additionally, the validation accuracy of approximately 90% indicates that the model can successfully classify unseen samples, demonstrating good generalization ability.
Our model has low approximation error. Although there are some indications of approximation error in the decision plots, it is not considered high. The decision boundary is close to optimal, and the validation accuracy of around 90% further supports this assessment. If the model had underfitted, we would expect a significantly lower validation accuracy.
Consider the first binary classifier you trained in this notebook and the confusion matrices we plotted for it.
For the validation dataset, would you expect the FPR or the FNR to be higher, and why? Recall that you have full knowledge of the data generating process.
display_answer(hw2.answers.part3_q2)
Your answer:
We would expect the False Negative Rate (FNR) to be higher than the False Positive Rate (FPR) for the validation dataset. The reasoning behind this expectation is that the training dataset's plot reveals a higher concentration of positive samples surrounded by negative samples. As a result, the learned decision boundary is likely to classify more positive samples as negative. This tendency carries over to the validation dataset, resulting in a higher FNR.
You're training a binary classifier screening of a large cohort of patients for some disease, with the aim to detect the disease early, before any symptoms appear. You train the model on easy-to-obtain features, so screening each individual patient is simple and low-cost. In case the model classifies a patient as sick, she must then be sent to furhter testing in order to confirm the illness. Assume that these further tests are expensive and involve high-risk to the patient. Assume also that once diagnosed, a low-cost treatment exists.
You wish to screen as many people as possible at the lowest possible cost and loss of life. Would you still choose the same "optimal" point on the ROC curve as above? If not, how would you choose it? Answer these questions for two possible scenarios:
Explain your answers.
display_answer(hw2.answers.part3_q3)
Your answer:
In the scenario where a person with the disease will develop non-lethal symptoms that confirm the diagnosis, the focus would be on lowering the False Positive Rate (FPR) to minimize the costs associated with unnecessary follow-up tests, while accepting a slightly higher False Negative Rate (FNR) given the low-risk nature of the disease and the eventual appearance of detectable symptoms. Therefore, the preferred point on the ROC curve would have a low 1-FNR and a corresponding low False Positive Rate (FPR), such as the point (FPR, TPR) = (0, 0).
In the scenario where a person with the disease shows no clear symptoms and faces a high probability of death without early diagnosis, the focus shifts to saving lives. Here, it becomes essential to prioritize a low FNR to identify those individuals at risk and provide timely intervention. Even if it incurs additional costs (higher FPR), the primary objective is to minimize the loss of life. Consequently, the preferred point on the ROC curve would have a low FNR, such as the point (FPR, TPR) = (0, 1).
Analyze your results from the Architecture Experiment.
depth, width varies).width, depth varies).depth=1, width=32 and depth=4, width=8display_answer(hw2.answers.part3_q4)
Your answer:
4.1.
For the case of depth=1, it was observed that as the width increased, the validation and test accuracies decreased. This trend was also observed for depth=2. However, for depth=4, the val and test accuracy initially decreased with increasing width, but for the largest width value (32), the accuracy improved and almost reached the level observed with width=2 (87% compared to the initial accuracy of 89%). Overall, the best performance was achieved when the depth was set to 2 and the width was set to 2, resulting in an (test) accuracy of 91.
Regarding the decision boundaries, it was observed that for any fixed depth, increasing the width led to more flexible and complex decision boundaries. This suggests that wider models have the ability to capture more intricate patterns and relationships within the data, enabling them to classify samples with higher complexity.
4.2.
Upon analyzing the results for fixed width and varying depths, several patterns emerge. For smaller widths, such as 2 and 8, increasing the depth initially improves the accuracy of the model. However, there is a threshold beyond which increasing the depth no longer benefits the model, and the accuracy starts to decline. This decline is even more pronounced, with the accuracy dropping below the initial values. For instance, with width=2, the test accuracy starts at 90%, increases to 91.5% for depth=2, but then decreases to 89.5% for depth=8. A similar trend is observed for width=8, starting at 87%, increasing to 91%, and then dropping to 84.3%. Interestingly, for the largest width value of 32, increasing the depth leads to a consistent improvement in accuracy. The test accuracy increases from 84.1% for depth=1 to 86.8% for depth=2, and further to 87.8% for depth=8. This suggests that deeper networks are more effective in capturing the complexity of the data when the width is larger.
In terms of decision boundaries, the effect of increasing the depth is more pronounced for smaller width values (2 and 8). In these cases, increasing the depth results in more flexible and complex decision boundaries. On the other hand, for width=32, there is less noticeable difference in the decision boundaries as the depth increases, indicating that the initial depth may already capture the complexity of the data fairly well.
These observations align with the understanding that increasing the depth of an MLP can enhance its representational capacity and ability to learn intricate patterns in the data. However, there is a trade-off, as excessively deep networks may suffer from issues like vanishing gradients or overfitting. Therefore, striking the right balance between depth and width is crucial to achieve optimal performance and decision boundary complexity.
4.3.
When comparing the results for configurations with the same number of total parameters, namely depth=1 and width=32 (referred to as Configuration A), and depth=4 and width=8 (referred to as Configuration B), several observations can be made. Firstly, in Configuration A, we observe a higher threshold value of 0.32 compared to 0.28 in Configuration B. The threshold value indicates the decision boundary of the model, and a higher threshold suggests a more conservative classification approach. Secondly, when considering the validation and test accuracies, Configuration B performs slightly better than Configuration A. The validation accuracy for Configuration B is 82.7%, while for Configuration A it is 80.6%. Similarly, the test accuracy for Configuration B is 84.3%, whereas for Configuration A it is 84.1%. These results indicate that Configuration B achieves slightly higher accuracy on unseen data. In terms of decision boundaries, both configurations exhibit similar patterns. The decision boundaries in both cases appear to be comparable, suggesting that the models have similar capabilities in separating and classifying different data points. Overall, when comparing Configuration A (depth=1, width=32) and Configuration B (depth=4, width=8), Configuration B shows slightly better performance in terms of validation and test accuracies while maintaining similar decision boundary patterns. This implies that increasing the depth to 4 and reducing the width to 8 leads to improved model performance in this scenario.
4.4.
The effect of threshold selection on the validation set and its impact on the test set can be complex and dependent on various factors. In our case, we can see that in 5 out of 9 instances, the test accuracy is higher than the validation accuracy after applying the chosen threshold. This suggests that, to some extent, the threshold selection improved the results on the test set. However, it is important to consider the general principles and potential limitations associated with threshold selection. In some cases, the optimal threshold determined based on the validation set does not necessarily lead to improved results on the test set. The reason for this lies in the sensitivity of the optimal threshold to the specific dataset used for validation. Although the validation set is intended to represent an independent distribution and provide an estimate of model performance, it does not guarantee identical results on the test set. Differences in the distributions and characteristics of the validation and test sets can lead to variations in the optimal threshold. Therefore, selecting the optimal threshold solely based on the validation set may not always yield the best performance on the test set.
In this part we will explore convolution networks. We'll implement a common block-based deep CNN pattern with an without residual connections.
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
Convolutional layers are the most essential building blocks of the state of the art deep learning image classification models and also play an important role in many other tasks. As we saw in the tutorial, when applied to images, convolutional layers operate on and produce volumes (3D tensors) of activations.
A convenient way to interpret convolutional layers for images is as a collection of 3D learnable filters, each of which operates on a small spatial region of the input volume. Each filter is convolved with the input volume ("slides over it"), and a dot product is computed at each location followed by a non-linearity which produces one activation. All these activations produce a 2D plane known as a feature map. Multiple feature maps (one for each filter) comprise the output volume.
A crucial property of convolutional layers is their translation equivariance, i.e. shifting the input results in and equivalently shifted output. This produces the ability to detect features regardless of their spatial location in the input.
Convolutional network architectures usually follow a pattern basic repeating blocks: one or more convolution layers, each followed by a non-linearity (generally ReLU) and then a pooling layer to reduce spatial dimensions. Usually, the number of convolutional filters increases the deeper they are in the network. These layers are meant to extract features from the input. Then, one or more fully-connected layers is used to combine the extracted features into the required number of output class scores.
PyTorch provides all the basic building blocks needed for creating a convolutional arcitecture within the torch.nn package.
Let's use them to create a basic convolutional network with the following architecture pattern:
[(CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
Here $N$ is the total number of convolutional layers, $P$ specifies how many convolutions to perform before each pooling layer and $M$ specifies the number of hidden fully-connected layers before the final output layer.
TODO: Complete the implementaion of the CNN class in the hw2/cnn.py module.
Use PyTorch's nn.Conv2d and nn.MaxPool2d for the convolution and pooling layers.
It's recommended to implement the missing functionality in the order of the class' methods.
from hw2.cnn import CNN
test_params = [
dict(
in_size=(3,100,100), out_classes=10,
channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
conv_params=dict(kernel_size=3, stride=1, padding=1),
activation_type='relu', activation_params=dict(),
pooling_type='max', pooling_params=dict(kernel_size=2),
),
dict(
in_size=(3,100,100), out_classes=10,
channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
conv_params=dict(kernel_size=5, stride=2, padding=3),
activation_type='lrelu', activation_params=dict(negative_slope=0.05),
pooling_type='avg', pooling_params=dict(kernel_size=3),
),
dict(
in_size=(3,100,100), out_classes=3,
channels=[16]*5, pool_every=3, hidden_dims=[100]*1,
conv_params=dict(kernel_size=2, stride=2, padding=2),
activation_type='lrelu', activation_params=dict(negative_slope=0.1),
pooling_type='max', pooling_params=dict(kernel_size=2),
),
]
for i, params in enumerate(test_params):
torch.manual_seed(seed)
net = CNN(**params)
print(f"\n=== test {i=} ===")
print(net)
torch.manual_seed(seed)
test_out = net(torch.ones(1, 3, 100, 100))
print(f'{test_out=}')
expected_out = torch.load(f'tests/assets/expected_conv_out_{i:02d}.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
=== test i=0 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
(2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU()
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU()
(7): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU()
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=20000, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=100, bias=True)
(3): ReLU()
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0745, -0.1058, 0.0928, 0.0476, 0.0057, 0.0051, 0.0938, -0.0582,
0.0573, 0.0583]], grad_fn=<AddmmBackward0>)
max_diff=7.450580596923828e-09
=== test i=1 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(1): LeakyReLU(negative_slope=0.05)
(2): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(3): LeakyReLU(negative_slope=0.05)
(4): AvgPool2d(kernel_size=3, stride=3, padding=0)
(5): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(6): LeakyReLU(negative_slope=0.05)
(7): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(8): LeakyReLU(negative_slope=0.05)
(9): AvgPool2d(kernel_size=3, stride=3, padding=0)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=32, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.05)
(2): Linear(in_features=100, out_features=100, bias=True)
(3): LeakyReLU(negative_slope=0.05)
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0724, -0.0030, 0.0637, -0.0073, 0.0932, -0.0662, -0.0656, 0.0076,
0.0193, 0.0241]], grad_fn=<AddmmBackward0>)
max_diff=0.0
=== test i=2 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(1): LeakyReLU(negative_slope=0.1)
(2): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(3): LeakyReLU(negative_slope=0.1)
(4): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(5): LeakyReLU(negative_slope=0.1)
(6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(7): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(8): LeakyReLU(negative_slope=0.1)
(9): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(10): LeakyReLU(negative_slope=0.1)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=400, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.1)
(2): Linear(in_features=100, out_features=3, bias=True)
(3): Identity()
)
)
)
test_out=tensor([[-0.0004, -0.0094, 0.0817]], grad_fn=<AddmmBackward0>)
max_diff=0.0
As before, we'll wrap our model with a Classifier that provides the necessary functionality for calculating probability scores and obtaining class label predictions.
This time, we'll use a simple approach that simply selects the class with the highest score.
TODO: Implement the ArgMaxClassifier in the hw2/classifier.py module.
from hw2.classifier import ArgMaxClassifier
model = ArgMaxClassifier(model=CNN(**test_params[0]))
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test.assertEqual(model.classify(test_image).shape, (1,))
test.assertEqual(model.predict_proba(test_image).shape, (1, 10))
test.assertAlmostEqual(torch.sum(model.predict_proba(test_image)).item(), 1.0, delta=1e-3)
Let's now load CIFAR-10 to use as our dataset.
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())
print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')
x0,_ = ds_train[0]
in_size = x0.shape
num_classes = 10
print('input image size =', in_size)
Files already downloaded and verified Files already downloaded and verified Train: 50000 samples Test: 10000 samples input image size = torch.Size([3, 32, 32])
Now as usual, as a sanity test let's make sure we can overfit a tiny dataset with our model. But first we need to adapt our Trainer for PyTorch models.
TODO:
ClassifierTrainer class in the hw2/training.py module if you haven't done so already.part4_optim_hp(), respectively, in hw2/answers.py.from hw2.training import ClassifierTrainer
from hw2.answers import part4_optim_hp
torch.manual_seed(seed)
# Define a tiny part of the CIFAR-10 dataset to overfit it
batch_size = 2
max_batches = 25
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
# Create model, loss and optimizer instances
model = ArgMaxClassifier(
model=CNN(
in_size, num_classes, channels=[32], pool_every=1, hidden_dims=[100],
conv_params=dict(kernel_size=3, stride=1, padding=1),
pooling_params=dict(kernel_size=2),
)
)
hp_optim = part4_optim_hp()
loss_fn = hp_optim.pop('loss_fn')
optimizer = torch.optim.SGD(params=model.parameters(), **hp_optim)
# Use ClassifierTrainer to run only the training loop a few times.
trainer = ClassifierTrainer(model, loss_fn, optimizer, device)
best_acc = 0
for i in range(25):
res = trainer.train_epoch(dl_train, max_batches=max_batches, verbose=(i%5==0))
best_acc = res.accuracy if res.accuracy > best_acc else best_acc
# Test overfitting
test.assertGreaterEqual(best_acc, 90)
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
train_batch: 0%| | 0/25 [00:00<?, ?it/s]
A very common addition to the basic convolutional architecture described above are shortcut connections. First proposed by He et al. (2016), this simple addition has been shown to be a crucial ingredient in order to achieve effective learning with very deep networks. Virtually all state of the art image classification models from recent years use this technique.
The idea is to add a shortcut, or skip, around every two or more convolutional layers:
On the left we see an example of a regular Residual Block, that takes a 64 channel input, and performs two 3X3 convolutions , which are added to the original input.
On the right we see an exapmle of a Bottleneck Residual Block, that takes a 256 channel input, projects it to a 64 channel tensor with a 1X1 convolution, then performs an inner 3X3 convolution, followd by another 1X1 projection convolution back to the original numer of channels, 256. The output is then added to the original input.
Overall, we can denote the structure of the bottleneck channels in the given example as 256->64->64->256, where the first and last arrows denote the 1X1 convolutions, and the middle arrow is the inner convolution. Note that the 1X1 convolution with the default parameters (in pytorch) is defined such that the only dimension of the tensor that changes is the number of channels.
This adds an easy way for the network to learn identity mappings: set the weight values to be very small. The outcome is that the convolutional layers learn a residual mapping, i.e. some delta that is applied to the identity map, instead of actually learning a completely new mapping from scratch.
Lets start by implementing a general residual block, representing a structure similar to the above diagrams. Our residual block will be composed of:
1x1 convolution to project the channel dimension.TODO: Complete the implementation of the ResidualBlock's __init__() method in the hw2/cnn.py module.
from hw2.cnn import ResidualBlock
torch.manual_seed(seed)
resblock = ResidualBlock(
in_channels=3, channels=[6, 4]*2, kernel_sizes=[3, 5]*2,
batchnorm=True, dropout=0.2
)
print(resblock)
torch.manual_seed(seed)
test_out = resblock(torch.ones(1, 3, 32, 32))
print(f'{test_out.shape=}')
expected_out = torch.load('tests/assets/expected_resblock_out.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.2, inplace=False)
(2): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU()
(4): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(5): Dropout2d(p=0.2, inplace=False)
(6): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): ReLU()
(8): Conv2d(4, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): Dropout2d(p=0.2, inplace=False)
(10): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
)
(shortcut_path): Sequential(
(0): Identity()
(1): Conv2d(3, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
test_out.shape=torch.Size([1, 4, 32, 32])
max_diff=5.960464477539062e-07
In the ResNet Block diagram shown above, the right block is called a bottleneck block. This type of block is mainly used deep in the network, where the feature space becomes increasingly high-dimensional (i.e. there are many channels).
Instead of applying a KxK conv layer on the original input channels, a bottleneck block first projects to a lower number of features (channels), applies the KxK conv on the result, and then projects back to the original feature space. Both projections are performed with 1x1 convolutions.
TODO: Complete the implementation of the ResidualBottleneckBlock in the hw2/cnn.py module.
from hw2.cnn import ResidualBottleneckBlock
torch.manual_seed(seed)
resblock_bn = ResidualBottleneckBlock(
in_out_channels=256, inner_channels=[64, 32, 64], inner_kernel_sizes=[3, 5, 3],
batchnorm=False, dropout=0.1, activation_type="lrelu"
)
print(resblock_bn)
# Test a forward pass
torch.manual_seed(seed)
test_in = torch.ones(1, 256, 32, 32)
test_out = resblock_bn(test_in)
print(f'{test_out.shape=}')
assert test_out.shape == test_in.shape
expected_out = torch.load('tests/assets/expected_resblock_bn_out.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Dropout2d(p=0.1, inplace=False)
(5): LeakyReLU(negative_slope=0.01)
(6): Conv2d(64, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(7): Dropout2d(p=0.1, inplace=False)
(8): LeakyReLU(negative_slope=0.01)
(9): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(10): Dropout2d(p=0.1, inplace=False)
(11): LeakyReLU(negative_slope=0.01)
(12): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
test_out.shape=torch.Size([1, 256, 32, 32])
max_diff=1.1920928955078125e-07
Now, based on the ResidualBlock, we'll implement our own variation of a residual network (ResNet),
with the following architecture:
[-> (CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
\------- SKIP ------/
Note that $N$, $P$ and $M$ are as before, however now $P$ also controls the number of convolutional layers to add a skip-connection to.
TODO: Complete the implementation of the ResNet class in the hw2/cnn.py module.
You must use your ResidualBlocks or ResidualBottleneckBlocks to group together every $P$ convolutional layers.
from hw2.cnn import ResNet
test_params = [
dict(
in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
pool_every=4, hidden_dims=[100]*2,
activation_type='lrelu', activation_params=dict(negative_slope=0.01),
pooling_type='avg', pooling_params=dict(kernel_size=2),
batchnorm=True, dropout=0.1,
bottleneck=False
),
dict(
# create 64->16->64 bottlenecks
in_size=(3,100,100), out_classes=5, channels=[64, 16, 64]*4,
pool_every=3, hidden_dims=[64]*1,
activation_type='tanh',
pooling_type='max', pooling_params=dict(kernel_size=2),
batchnorm=True, dropout=0.1,
bottleneck=True
)
]
for i, params in enumerate(test_params):
torch.manual_seed(seed)
net = ResNet(**params)
print(f"\n=== test {i=} ===")
print(net)
torch.manual_seed(seed)
test_out = net(torch.ones(1, 3, 100, 100))
print(f'{test_out=}')
expected_out = torch.load(f'tests/assets/expected_resnet_out_{i:02d}.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
=== test i=0 ===
ResNet(
(feature_extractor): Sequential(
(0): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.01)
(8): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): Dropout2d(p=0.1, inplace=False)
(10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): LeakyReLU(negative_slope=0.01)
(12): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
(1): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
(1): AvgPool2d(kernel_size=2, stride=2, padding=0)
(2): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=160000, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=100, out_features=100, bias=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0422, 0.0332, 0.1870, -0.0532, -0.0742, 0.1143, -0.0617, -0.0467,
0.0852, 0.0221]], grad_fn=<AddmmBackward0>)
max_diff=8.195638656616211e-08
=== test i=1 ===
ResNet(
(feature_extractor): Sequential(
(0): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(64, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
(1): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
(1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(2): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(4): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
(5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(6): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential(
(0): Identity()
)
)
(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(mlp): MLP(
(layers): Sequential(
(0): Linear(in_features=2304, out_features=64, bias=True)
(1): Tanh()
(2): Linear(in_features=64, out_features=5, bias=True)
(3): Identity()
)
)
)
test_out=tensor([[ 0.0237, -0.1945, -0.0085, -0.4024, -0.2667]],
grad_fn=<AddmmBackward0>)
max_diff=2.384185791015625e-07
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Consider the bottleneck block from the right side of the ResNet diagram above. Compare it to a regular block that performs a two 3x3 convs directly on the 256-channel input (i.e. as shown in the left side of the diagram, with a different number of channels). Explain the differences between the regular block and the bottleneck block in terms of:
display_answer(hw2.answers.part4_q1)
Your answer:
1.1.
Number of parameters in the regular block: The first convolutional layer has a kernel size of 3x3 and 256 input channels. With an additional bias term, it gives us (3x3x256 + 1) parameters for each of the 256 output channels. Therefore, the total number of parameters for the first convolutional layer is (3x3x256 + 1) x 256. The second convolutional layer also has a kernel size of 3x3 and 256 input channels. Similar to the first layer, it yields (3x3x256 + 1) parameters for each of the 256 output channels. Thus, the total number of parameters for the second convolutional layer is (3x3x256 + 1) x 256. Since there are two convolutional layers in the regular block, we get (3x3x256 + 1) x 256 x 2 = 1,180,160.
Number of parameters in the bottleneck block: The first convolutional layer is a 1x1 convolution that reduces the 256 input channels to 64 output channels. Including the bias term, this gives us (1x1x256 + 1) parameters. The second convolutional layer has a kernel size of 3x3 and operates on the 64 input channels. With an additional bias term, it yields (3x3x64 + 1) parameters for each of the 64 output channels. The third and final convolutional layer is another 1x1 convolution that expands the 64 input channels back to 256 output channels. Including the bias term, this gives us (1x1x64 + 1) parameters. Total number of parameters in the bottleneck block = (1x1x256 + 1) x 64 + (3x3x64 + 1) x 64 + (1x1x64 + 1) x 256 = 70,016.
In terms of the number of parameters, we can observe that the bottleneck block has significantly fewer parameters (70,016) compared to the regular block (1,180,160). This reduction in the number of parameters can lead to faster training and reduced computational requirements.
1.2.
The number of floating point operations is directly proportional to the size of the kernel and the number of parameters in each layer. Thus, the regular network requires a significantly larger number of calculations compared to the bottleneck network. The number of floating point calculations is given by: number_of_params∗(image_size−(kernel_size−stride))^2.
1.3.
The regular block, which consists of two 3x3 convolutions, excels in combining input within feature maps due to its ability to capture spatial relationships and patterns within each layer. With two convolutions, it has a wider receptive field, allowing it to consider more features within the feature map. On the other hand, the bottleneck block utilizes a single 3x3 convolution, limiting its ability to capture fine-grained details within the feature map.
In terms of combining input across feature maps, both the regular and bottleneck blocks demonstrate similar capabilities. Both blocks maintain the same number of input and output channels for each convolution layer, enabling them to combine information across the feature maps effectively. The bottleneck block, however, benefits from transitioning between a higher number of channels (e.g., from 256 to 64), facilitating the integration of high-level and low-level information across feature maps.
In this part we will explore convolution networks and the effects of their architecture on accuracy. We'll use our deep CNN implementation and perform various experiments on it while varying the architecture. Then we'll implement our own custom architecture to see whether we can get high classification results on a large subset of CIFAR-10.
Training will be performed on GPU.
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
We will now perform a series of experiments that train various model configurations on a part of the CIFAR-10 dataset.
To perform the experiments, you'll need to use a machine with a GPU since training time might be too long otherwise.
Here's an example of running a forward pass on the GPU (assuming you're running this notebook on a GPU-enabled machine).
from hw2.cnn import ResNet
net = ResNet(
in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
pool_every=4, hidden_dims=[100]*2,
pooling_type='avg', pooling_params=dict(kernel_size=2),
)
net = net.to(device)
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test_image = test_image.to(device)
test_out = net(test_image)
Notice how we called .to(device) on both the model and the input tensor.
Here the device is a torch.device object that we created above. If an nvidia GPU is available on the machine you're running this on, the device will be 'cuda'. When you run .to(device) on a model, it recursively goes over all the model parameter tensors and copies their memory to the GPU. Similarly, calling .to(device) on the input image also copies it.
In order to train on a GPU, you need to make sure to move all your tensors to it. You'll get errors if you try to mix CPU and GPU tensors in a computation.
print(f'This notebook is running with device={device}')
print(f'The model parameter tensors are also on device={next(net.parameters()).device}')
print(f'The test image is also on device={test_image.device}')
print(f'The output is therefore also on device={test_out.device}')
This notebook is running with device=cpu The model parameter tensors are also on device=cpu The test image is also on device=cpu The output is therefore also on device=cpu
First, please read the course servers guide carefully.
To run the experiments on the course servers, you can use the py-sbatch.sh script directly to perform a single experiment run in batch mode (since it runs python once), or use the srun command to do a single run in interactive mode. For example, running a single run of experiment 1 interactively (after conda activate of course):
srun -c 2 --gres=gpu:1 --pty python -m hw2.experiments run-exp -n test -K 32 64 -L 2 -P 2 -H 100
To perform multiple runs in batch mode with sbatch (e.g. for running all the configurations of an experiments), you can create your own script based on py-sbatch.sh and invoke whatever commands you need within it.
Don't request more than 2 CPU cores and 1 GPU device for your runs. The code won't be able to utilize more than that anyway, so you'll see no performance gain if you do. It will only cause delays for other students using the servers.
results folder on your local machine.
This notebook will only display the results, not run the actual experiment code (except for a demo run).run_name parameter that will also be the base name of the results file which this
notebook will expect to load.hw2/experiments.py module.
This module has a CLI parser so that you can invoke it as a script and pass in all the
configuration parameters for a single experiment run.python -m hw2.experiments run-exp to run an experiment, and not
python hw2/experiments.py run-exp, regardless of how/where you run it.In this part we will test some different architecture configurations based on our CNN and ResNet.
Specifically, we want to try different depths and number of features to see the effects these parameters have on the model's performance.
To do this, we'll define two extra hyperparameters for our model, K (filters_per_layer) and L (layers_per_block).
K is a list, containing the number of filters we want to have in our conv layers.L is the number of consecutive layers with the same number of filters to use.For example, if K=[32, 64] and L=2 it means we want two conv layers with 32 filters followed by two conv layers with 64 filters. If we also use pool_every=3, the feature-extraction part of our model will be:
Conv(X,32)->ReLu->Conv(32,32)->ReLU->Conv(32,64)->ReLU->MaxPool->Conv(64,64)->ReLU
We'll try various values of the K and L parameters in combination and see how each architecture trains. All other hyperparameters are up to you, including the choice of the optimization algorithm, the learning rate, regularization and architecture hyperparams such as pool_every and hidden_dims. Note that you should select the pool_every parameter wisely per experiment so that you don't end up with zero-width feature maps.
You can try some short manual runs to determine some good values for the hyperparameters or implement cross-validation to do it. However, the dataset size you test on should be large. If you limit the number of batches, make sure to use at least 30000 training images and 5000 validation images.
The important thing is that you state what you used, how you decided on it, and explain your results based on that.
First we need to write some code to run the experiment.
TODO:
cnn_experiment() function in the hw2/experiments.py module.Trainer class.The following block tests that your implementation works. It's also meant to show you that each experiment run creates a result file containing the parameters to reproduce and the FitResult object for plotting.
from hw2.experiments import load_experiment, cnn_experiment
from cs236781.plot import plot_fit
# Test experiment1 implementation on a few data samples and with a small model
cnn_experiment(
'test_run', seed=seed, bs_train=50, batches=10, epochs=10, early_stopping=5,
filters_per_layer=[32,64], layers_per_block=1, pool_every=1, hidden_dims=[100],
model_type='resnet',
)
# There should now be a file 'test_run.json' in your `results/` folder.
# We can use it to load the results of the experiment.
cfg, fit_res = load_experiment('results/test_run_L1_K32-64.json')
_, _ = plot_fit(fit_res, train_test_overlay=True)
# And `cfg` contains the exact parameters to reproduce it
print('experiment config: ', cfg)
Files already downloaded and verified Files already downloaded and verified --- EPOCH 1/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 2/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 3/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 4/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 5/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 6/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 7/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 8/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 9/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
--- EPOCH 10/10 ---
train_batch: 0%| | 0/10 [00:00<?, ?it/s]
test_batch: 0%| | 0/2 [00:00<?, ?it/s]
*** Output file ./results/test_run_L1_K32-64.json written
experiment config: {'run_name': 'test_run', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'layers_per_block': 1, 'pool_every': 1, 'hidden_dims': [100], 'model_type': 'resnet', 'stride': 1, 'padding': 1, 'kernel_size': 3, 'dilation': 1, 'pooling_kernel_size': 2, 'pooling_stride': 2, 'lrelu_slope': 0.01, 'activation_type': 'lrelu', 'dropout': 0.1, 'batchnorm': False, 'bottleneck': False, 'amsgrad': False, 'kw': {}, 'filters_per_layer': [32, 64]}
We'll use the following function to load multiple experiment results and plot them together.
def plot_exp_results(filename_pattern, results_dir='results'):
fig = None
result_files = glob.glob(os.path.join(results_dir, filename_pattern))
result_files.sort()
if len(result_files) == 0:
print(f'No results found for pattern {filename_pattern}.', file=sys.stderr)
return
for filepath in result_files:
m = re.match('exp\d_(\d_)?(.*)\.json', os.path.basename(filepath))
cfg, fit_res = load_experiment(filepath)
fig, axes = plot_fit(fit_res, fig, legend=m[2],log_loss=True)
del cfg['filters_per_layer']
del cfg['layers_per_block']
print('common config: ', cfg)
L)¶First, we'll test the effect of the network depth on training.
Configuratons:
K=32 fixed, with L=2,4,8,16 varying per runK=64 fixed, with L=2,4,8,16 varying per runSo 8 different runs in total.
Naming runs:
Each run should be named exp1_1_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_1_L2_K32.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_1_L*_K32*.json')
common config: {'run_name': 'exp1_1', 'out_dir': './results', 'seed': 1766847654, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 100, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.003, 'reg': 0.003, 'pool_every': 2, 'hidden_dims': [512], 'model_type': 'cnn', 'stride': 1, 'padding': 1, 'kernel_size': 3, 'dilation': 1, 'pooling_kernel_size': 2, 'pooling_stride': 2, 'lrelu_slope': 0.01, 'activation_type': 'lrelu', 'dropout': 0.1, 'batchnorm': False, 'bottleneck': False, 'amsgrad': False, 'kw': {}}
plot_exp_results('exp1_1_L*_K64*.json')
common config: {'run_name': 'exp1_1', 'out_dir': './results', 'seed': 775983370, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 100, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.003, 'reg': 0.003, 'pool_every': 2, 'hidden_dims': [512], 'model_type': 'cnn', 'stride': 1, 'padding': 1, 'kernel_size': 3, 'dilation': 1, 'pooling_kernel_size': 2, 'pooling_stride': 2, 'lrelu_slope': 0.01, 'activation_type': 'lrelu', 'dropout': 0.1, 'batchnorm': False, 'bottleneck': False, 'amsgrad': False, 'kw': {}}
K)¶Now we'll test the effect of the number of convolutional filters in each layer.
Configuratons:
L=2 fixed, with K=[32],[64],[128] varying per run.L=4 fixed, with K=[32],[64],[128] varying per run.L=8 fixed, with K=[32],[64],[128] varying per run.So 9 different runs in total. To clarify, each run K takes the value of a list with a single element.
Naming runs:
Each run should be named exp1_2_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_2_L2_K32.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_2_L2*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 1033039586, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 100, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.003, 'reg': 0.003, 'pool_every': 1, 'hidden_dims': [512], 'model_type': 'cnn', 'stride': 1, 'padding': 1, 'kernel_size': 3, 'dilation': 1, 'pooling_kernel_size': 2, 'pooling_stride': 2, 'lrelu_slope': 0.01, 'activation_type': 'lrelu', 'dropout': 0.1, 'batchnorm': False, 'bottleneck': False, 'amsgrad': False, 'kw': {}}
plot_exp_results('exp1_2_L4*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 1123547418, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 100, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.003, 'reg': 0.003, 'pool_every': 1, 'hidden_dims': [512], 'model_type': 'cnn', 'stride': 1, 'padding': 1, 'kernel_size': 3, 'dilation': 1, 'pooling_kernel_size': 2, 'pooling_stride': 2, 'lrelu_slope': 0.01, 'activation_type': 'lrelu', 'dropout': 0.1, 'batchnorm': False, 'bottleneck': False, 'amsgrad': False, 'kw': {}}
plot_exp_results('exp1_2_L8*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 155155554, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 100, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.003, 'reg': 0.03, 'pool_every': 9, 'hidden_dims': [256, 256, 256, 256], 'model_type': 'cnn', 'stride': 1, 'padding': 1, 'kernel_size': 3, 'dilation': 1, 'pooling_kernel_size': 2, 'pooling_stride': 2, 'lrelu_slope': 0.01, 'activation_type': 'lrelu', 'dropout': 0.1, 'batchnorm': False, 'bottleneck': False, 'amsgrad': False, 'kw': {}}
K) and network depth (L)¶Now we'll test the effect of the number of convolutional filters in each layer.
Configuratons:
K=[64, 128] fixed with L=2,3,4 varying per run.So 3 different runs in total. To clarify, each run K takes the value of an array with a two elements.
Naming runs:
Each run should be named exp1_3_L{}_K{}-{} where the braces are placeholders for the values. For example, the first run should be named exp1_3_L2_K64-128.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_3*.json')
common config: {'run_name': 'exp1_3', 'out_dir': './results', 'seed': 1210900822, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 100, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 2, 'hidden_dims': [128], 'model_type': 'cnn', 'stride': 1, 'padding': 1, 'kernel_size': 3, 'dilation': 1, 'pooling_kernel_size': 2, 'pooling_stride': 2, 'lrelu_slope': 0.01, 'activation_type': 'lrelu', 'dropout': 0.1, 'batchnorm': False, 'bottleneck': False, 'amsgrad': False, 'kw': {}}
Now we'll test the effect of skip connections on the training and performance.
Configuratons:
K=[32] fixed with L=8,16,32 varying per run.K=[64, 128, 256] fixed with L=2,4,8 varying per run.So 6 different runs in total.
Naming runs:
Each run should be named exp1_4_L{}_K{}-{}-{} where the braces are placeholders for the values.
TODO: Run the experiment on the above configuration with the ResNet model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_4_L*_K32.json')
common config: {'run_name': 'exp1_4', 'out_dir': './results', 'seed': 1779119314, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 100, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.004, 'reg': 0.001, 'pool_every': 2, 'hidden_dims': [512], 'model_type': 'resnet', 'stride': 1, 'padding': 1, 'kernel_size': 3, 'dilation': 1, 'pooling_kernel_size': 2, 'pooling_stride': 2, 'lrelu_slope': 0.01, 'activation_type': 'lrelu', 'dropout': 0.1, 'batchnorm': False, 'bottleneck': False, 'amsgrad': False, 'kw': {}}
plot_exp_results('exp1_4_L*_K64*.json')
common config: {'run_name': 'exp1_4', 'out_dir': './results', 'seed': 1554481236, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 100, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.003, 'reg': 0.003, 'pool_every': 6, 'hidden_dims': [256, 256, 256, 256], 'model_type': 'resnet', 'kw': {}}
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Analyze your results from experiment 1.1. In particular,
L for which the network wasn't trainable? what causes this? Suggest two things which may be done to resolve it at least partially.display_answer(hw2.answers.part5_q1)
Your answer:
1.1.
Analyzing the graphs, it becomes evident that increasing the network depth resulted in a decline in accuracy, best given by shallowest depths 2 and 4, and the depth of 4 yielded the best results for both k=32 and k=64. a possible explanation could be overfitting: Deeper networks generally have a higher capacity to learn complex representations, but they are also more susceptible to overfitting. The depth of 4 may have been effective in preventing overfitting by striking a balance between capturing relevant features and avoiding excessive model complexity. By not going deeper, the network might have avoided overemphasizing noise or irrelevant patterns, leading to better generalization performance on your dataset.
1.2.
In our experiment, we encountered instances where the network became non-trainable for values of L=8 and L=16. This issue can be attributed to the problem of vanishing gradients, observed when the number of layers exceeds a certain threshold (in our case, above 4). The presence of vanishing gradients causes the gradients to diminish significantly, eventually reaching zero. This phenomenon hinders the model's ability to learn and make updates to its parameters. To partially address the vanishing gradients problem, two potential solutions can be considered. Firstly, the utilization of batch normalization can help alleviate this issue. By normalizing the input to each layer, ensuring zero mean and unit variance, the gradients are allowed to flow more smoothly throughout the network without vanishing. Secondly, incorporating skip connections, inspired by the ResNet architecture, can also mitigate the vanishing gradients problem. By establishing direct connections that bypass upper layers and enable gradients to flow directly to lower layers, the network can maintain a more stable gradient flow during training. This approach promotes better information propagation and enables the model to learn effectively, even with larger values of L.
Analyze your results from experiment 1.2. In particular, compare to the results of experiment 1.1.
display_answer(hw2.answers.part5_q2)
Your answer:
Upon analyzing the provided graphs, several observations can be made. First, for a fixed value of L, larger values of K tend to yield higher accuracy in both training and testing. Additionally, in the comparison between L=4 and L=2, L=4 consistently produces superior results. Similar to the findings in experiment 1.1, we observe that for L=8 (where L>4), the network becomes non-trainable, resulting in extremely low accuracy across all varying values of K. This aligns with our previous understanding. In comparison to the results from experiment 1.1, where the best performance was achieved with L=4 and K=32, the findings from experiment 1.2 demonstrate even better outcomes. Specifically, we achieve test accuracies surpassing 70% for L=4 and larger values of K such as K=128 and K=256.
Analyze your results from experiment 1.3.
display_answer(hw2.answers.part5_q3)
Your answer:
Upon analyzing the graphs, it is evident that the model's performance varies with different values of L (depth). Interestingly, for this experiment, L=3, which corresponds to the second lowest depth, yields the highest test accuracy. This suggests that a moderate depth is optimal for achieving better results in terms of accuracy.
Furthermore, a notable observation is that the accuracy drastically drops for L=4. In previous experiments 1.1 and 1.2, such a significant drop in accuracy occurred only for larger values of L, such as L=8. This indicates the presence of the vanishing gradients phenomenon, where the gradients diminish exponentially as they propagate through the deeper layers of the network.
Analyze your results from experiment 1.4. Compare to experiment 1.1 and 1.3.
display_answer(hw2.answers.part5_q4)
Your answer:
Upon analyzing the provided graphs, it becomes evident that shallower depths yield better outcomes in terms of test accuracies. This observation aligns with our earlier findings in section 1.1 and nearly those of section 1.3 where the second lowest depth was ideal in term of accuracy. When considering the ResNet architecture with a fixed value of K=32, we no longer observe the extremely low accuracy values associated with the vanishing gradients problem, as observed in section 1.1 for L>4 and in section 1.3 for L=4. This improvement is noticeable for L=8 and L=16, suggesting that the vanishing gradients issue has been mitigated to some extent. However, it is worth noting that for L=32, the phenomenon of vanishing gradients still persists. Interestingly, for larger values of K (>32, such as K=64, 128, 512), and across depths L=2, 4, and 8, we successfully mitigate the vanishing gradients problem, as previously suggested. Consequently, L=2 yields satisfactory results in terms of accuracy.
In this part we will use an object detection architecture called YOLO (You only look once) to detect objects in images. We'll use an already trained model weights (v5) found here: https://github.com/ultralytics/yolov5
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load the YOLO model
model = torch.hub.load("ultralytics/yolov5", "yolov5s")
model.to(device)
# Images
img1 = 'imgs/DolphinsInTheSky.jpg'
img2 = 'imgs/cat-shiba-inu-2.jpg'
Using cache found in /home/hay.e/.cache/torch/hub/ultralytics_yolov5_master requirements: Ultralytics requirement "gitpython>=3.1.30" not found, attempting AutoUpdate... requirements: ❌ AutoUpdate skipped (offline) YOLOv5 🚀 2023-5-30 Python-3.8.12 torch-1.10.1 CPU Fusing layers... YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients Adding AutoShape...
You are provided with 2 images (img1 and img2). TODO:
Detect objects using the YOLOv5 model for these 2 images.
Print the inference output with bounding boxes.
Calculate the number of pixels within a bounding box and the number in the background.
Hint: Given you stored the model output in a varibale named 'results', you may find 'results.pandas().xyxy' helpful
Look at the inference results and answer the question below.
%matplotlib inline
import torch
import cv2
import numpy as np
from matplotlib import pyplot as plt
imgs = [img1, img2]
# Detect objects using the YOLOv5 model for the images
results = model(imgs)
# Print the inference output with bounding boxes
results.print()
# Calculate the number of pixels within a bounding box and in the background
for i, img_path in enumerate(imgs):
img = cv2.imread(img_path)
img_height, img_width, _ = img.shape
df = results.pandas().xyxy[i]
num_objects = len(df)
total_pixels = img_height * img_width
box_pixels = 0
background_pixels = 0
for j in range(num_objects):
xmin = int(df.iloc[j]['xmin'])
ymin = int(df.iloc[j]['ymin'])
xmax = int(df.iloc[j]['xmax'])
ymax = int(df.iloc[j]['ymax'])
box_pixels += (xmax - xmin) * (ymax - ymin)
background_pixels = total_pixels - box_pixels
print(f"Image: {img_path}")
print("Number of pixels within bounding boxes:", box_pixels)
print("Number of pixels in the background:", background_pixels)
print()
# Plot bounding boxes on the image
plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
for j in range(num_objects):
xmin = int(df.iloc[j]['xmin'])
ymin = int(df.iloc[j]['ymin'])
xmax = int(df.iloc[j]['xmax'])
ymax = int(df.iloc[j]['ymax'])
label = f"{df.iloc[j]['name']}: {df.iloc[j]['confidence']:.2f}"
rect = plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False, color='red', linewidth=2)
plt.gca().add_patch(rect)
plt.text(xmin, ymin, label, color='red', fontsize=8, bbox=dict(facecolor='white', alpha=0.8))
plt.axis('off')
plt.show()
image 1/2: 183x275 2 persons, 1 surfboard image 2/2: 750x750 2 cats, 1 dog Speed: 78.1ms pre-process, 91.3ms inference, 6.5ms NMS per image at shape (2, 3, 640, 640)
Image: imgs/DolphinsInTheSky.jpg Number of pixels within bounding boxes: 15213 Number of pixels in the background: 35112
Image: imgs/cat-shiba-inu-2.jpg Number of pixels within bounding boxes: 499089 Number of pixels in the background: 63411
Analyze the inference results of the 2 images.
from cs236781.answers import display_answer
import hw2.answers
display_answer(hw2.answers.part6_q1)
Your answer:
1.1.
The performance of YOLO5 on the given images was not satisfactory. In the first image with dolphins, the model struggled to accurately detect the objects and misclassified them as persons or surfboards. It successfully identified some bounding boxes for the dolphins, but the presence of occlusion and the lack of distinguishable features, along with the dolphins being in an unnatural environment, posed challenges for accurate detection. Similarly, in the second image containing cats and dogs, the model faced difficulties due to the close proximity of the animals. It failed to create separate bounding boxes for each animal and mistakenly labeled them incorrectly. This could be attributed to the similarity between certain dog breeds and cats, leading to model bias, as well as occlusion caused by the overlapping animals.
1.2.
Several factors may have contributed to the poor performance. One potential reason is model bias, where the model's training data might have contained a higher representation of certain classes (e.g., persons on surfboards) compared to others (e.g., flying dolphins). Occlusion, especially in the second image where the cat was partially hidden by the dogs, also hindered accurate detection. Additionally, lighting conditions and the absence of specific classes, such as dolphins, in the trained model could have affected the results. To address these issues, several suggestions can be considered. Firstly, training the model on a dataset with increased variability per class, including various poses and environmental conditions, would help improve its generalization capabilities. Adjusting the size of bounding boxes could aid in better distinguishing closely positioned objects. Modifying the number of bounding boxes per grid cell could enable the model to locate multiple objects within the same area. Finally, fine-tuning the model using a dataset that encompasses a wider range of classes would provide more comprehensive object recognition capabilities.
Object detection pitfalls could be, for example: occlusion - when the objects are partially occlude, and thus missing important features, model bias - when a model learn some bias about an object, it could recognize it as something else in a different setup, and many others like Deformation, Illumination conditions, Cluttered or textured background and blurring due to moving objects.
TODO: Take pictures and that demonstrates 3 of the above object detection pitfalls, run inference and analyze the results.
#Insert the inference code here.
results = model('imgs/lion-in-bush.jpg')
print('Occlusion: A lion stealthily conceals itself behind a bush - is detected as a Bear')
results.show()
results = model('imgs/grenade-mouse.jpg')
print('Model Bias: A Grenade located as a computer mouse - is detected as mouse ')
results.show()
results = model('imgs/man-bike.png')
print('Deformed: A man leg looks like a woman body - 2 person figures are detected')
results.show()
Occlusion: A lion stealthily conceals itself behind a bush - is detected as a Bear
Model Bias: A Grenade located as a computer mouse - is detected as mouse